Is data science really a science?

What a data scientist really has to be able to do

The question about the skills of a data scientist has already been asked frequently and answered at least as often. Leading data experts and CIOs now agree on which tasks a data scientist should take on and which skills are required for this. I would like to try to bring this consensus into a graphic, into a layer model: the Data Science Knowledge Stack Everything about Skills on CIO.de.

Layer models are a popular way of explaining relationships in computer science. Anyone who has data science explained by a single person is very likely to learn about a very specific direction in this specialist discipline. For example, a data scientist could give you a seminar for data science for business analytics analytics with Python, not for example for medical analyzes and also not with R or Julia. Every data scientist conveys a very specific direction. Everything about analytics on CIO.de

However, we can abstract the skills of all data scientists!

In every data science project, a data scientist has to cope with problems on different levels, for example data access does not work as planned or the data has a different structure than expected. A data scientist can spend hours debugging their own source code or familiarizing themselves with new data science packages for their chosen programming language.

The right algorithms for data evaluation also have to be selected, correctly parameterized and tested; sometimes it turns out that the selected methods were not the optimal ones. Ultimately, added value should be generated for the specialist area and a data scientist is also faced with special challenges at this level. And all of this completely independent of the specialist area in which he applies his knowledge and tools.

Data Science Knowledge Stack

The data science knowledge stack provides a structured insight into the tasks and challenges of a data scientist. The layers of the stack also represent a bidirectional flow that runs from top to bottom and from bottom to top, because data science as a discipline is also bidirectional: data scientists try to answer questions asked with data or see what potential lies in the data to answer questions that have not yet been asked.

1st layer: Database Technology Knowledge

A data scientist mainly works with data, which is seldom directly structured in a CSV file, but in one or more databases that are subject to their own rules. In particular, business data, for example from the ERP or CRM system, is available in relational databases, often from Microsoft, Oracle, SAP or an open source alternative. A good data scientist not only masters the Structured Query Language (SQL), but is also aware of the importance of relational relationships, i.e. also knows the principle of normalization.

Other types of databases, so-called NoSQL databases (Not only SQL), are based on file formats, column or graph orientation. Examples of common NoSQL databases are MongoDB, Cassandra or Neo4J.

A data scientist must therefore be able to cope with different database systems and at least have a very good command of SQL - the quasi-standard for data processing.

2nd layer: Data Access & Transformation Knowledge

If there is data in a database, data scientists can carry out simple analyzes directly on the database. But how do we get the data into our special analysis tools? To do this, a data scientist needs to know how to export data from the database. An export as a CSV file can be sufficient for one-off actions, but parameters must already be taken into account here, for example useful separators, encoding, text qualifiers or splits for particularly large data.

With direct data connections, interfaces such as REST, ODBC or JDBC come into play. Some knowledge of socket connections and client-server architectures sometimes pays off. Furthermore, every data scientist should be familiar with synchronous and asynchronous encryption methods, because confidential data is not infrequently used and a minimum standard of security must always be observed, at least for business or medical applications.

Much data is not structured in a database, but is so-called unstructured or semi-structured data from documents or from internet sources. Here, too, analysts see themselves confronted with interfaces, for example with social media channels. Sometimes you want to analyze data in near real time, as is often the case with machine or financial data. Data streaming is a discipline in its own right, but any data scientist can quickly come into contact with it.

3rd layer: Programming Language Knowledge

For data scientists, programming languages ​​are tools for processing data and automating processing. Data scientists are usually not real software developers and in fact they don't have to worry about software security or ergonomics.

However, a certain basic knowledge of software architectures can be helpful, because after all, some programs for statistical analysis or machine learning should be integrated into an IT landscape. What is indispensable, however, is an understanding of object-oriented programming and a good knowledge of the syntax of the programming language that was selected for the data science project, currently R or Python.

At the level of the programming language, there are already many pitfalls in the day-to-day work of a data scientist, which are rooted in the programming language itself, because each has its own pitfalls. Details determine whether an analysis runs correctly or incorrectly: For example, whether data objects are passed as a copy or reference, or are treated like NULL values.

4th layer: Data Science Tool & Library Knowledge

Once a data scientist has loaded his data into his favorite tool, for example one from IBM, SAS or an open source alternative such as Octave, his core work is only just beginning. These tools are not necessarily self-explanatory, which is why there is a wide range of certification options for various data science tools.

Many (if not most) data scientists prefer to work directly with a programming language, but this alone is not enough to efficiently carry out statistical data analyzes or machine learning machine learning: We use data science libraries, i.e. packages that provide us with Providing data structures and methods as defaults and thus expanding the programming language, which often creates new pitfalls. Everything about machine learning on CIO.de.

Such a library, for example Scikit-Learn for Python, is a collection of methods implemented in the programming language and thus a data science tool. However, the use of such libraries has to be learned and therefore requires training and practical experience for reliable use.

When it comes to big data, big data analytics, i.e. the analysis of particularly large data, we enter the field of distributed computing. Tools (or frameworks) such as Apache Hadoop, Apache Spark or Apache Flink make it possible to process and evaluate data in parallel on several servers. These tools also provide their own libraries with their own characteristics, e.g. for machine learning. B. Mahout, MLlib, and FlinkML. Everything about big data on CIO.de.

5th layer: Data Science Method Knowledge

A data scientist is not just an operator of tools, but uses the tools to apply his analytical methods to data he has selected for the set goals. These analysis methods are, for example, evaluations of descriptive statistics, estimation methods or hypothesis tests. Machine learning processes for data mining, for example clustering or dimension reduction, or those in the direction of automated decision-making through classification or regression, are somewhat more mathematical.

Machine learning processes usually do not work straight away; they have to be improved using optimization processes. A data scientist must be able to identify under- and over-fitting and must prove that the forecast results are accurate enough for the intended use.

Special applications require special knowledge about machine learning or deep learning, which applies, for example, to the subject areas of image recognition (visual computing) or the processing of human language (natural language procession).

6th layer: technical expertise

Data science is not an end in itself, but a discipline that aims to answer questions from other subject areas with data. This is why data science is so diverse. Business economists need data scientists to analyze financial transactions, customer behavior or supplier situations. Natural scientists such as geologists, biologists or experimental physicists also use data science to make their observations with the aim of gaining knowledge. Engineers want to better understand the situation and interrelationships of machine systems or vehicles and medical professionals are interested in better diagnostics and medication for their patients.

In order for a data scientist to be able to support a certain specialist area with his knowledge of data, tools and analysis methods in a result-oriented manner, he himself needs a minimum of the corresponding technical expertise. Anyone who wants to do analyzes for businesspeople, engineers, natural scientists, doctors, lawyers or other interested parties must also be able to understand them professionally.

Narrower data science definition

While the data science pioneers have long established highly specialized teams, smaller companies, for example, are looking for the data science all-rounder who can take on the full range of tasks, from accessing the database to implementing the analytical application, while cutting back on specialist knowledge.

Companies with specialized data labs, i.e. data science departments as staff units or shared service centers, have long since distinguished their data experts relatively precisely into data scientists, data engineers and business analysts. The definition of data science and the delimitation of the skills that a data scientist should have therefore fluctuates between the broader and the narrower definition.

A closer look provides that a data engineer takes over the data provision, the data scientist loads it into his tools and runs the data analysis together with the colleagues from the department. According to this, a data scientist would need little or no knowledge about databases or APIs and the technical expertise would not actually be necessary in a pronounced form.

What data science looks like in professional practice certainly depends very much on whether it is used in a company or in science. The range of tasks of a data science all-rounder encompasses more than just the core area. The mistake of limiting data science to this narrower view certainly also arises in data science courses and seminars, because there - understandably - the focus is on data science as a discipline: programming, tools and methods from mathematics and statistics.