Data Scientist vs Data Engineer: Differences and Why You Need Both
Was Nikola Tesla a scientist or engineer? How about Edison? Or Da Vinci? It’s hard to give a solid answer, right? These men didn’t stop at scientific research and ended up conceptualizing or engineering their inventions. One discipline goes hand in hand with the other.
In the modern world, this distinction is even more vague. Engineers don’t only wear hardhats and operate on construction sites. Scientists don’t always wear lab coats and play with test tubes. Explaining the difference, especially when they both work with something intangible such as data, is difficult. If you’re an executive who has a hard time grasping the underlying processes of data science and get confused with terminology, keep reading. We will try to answer your questions and explain how these two critical data jobs are different and where they overlap.
Data science vs data engineering
Data flows in every organization in huge amounts. It comes from different sources, it’s formatted in a variety of ways, and it’s all extremely valuable if we organize it in a way that serves us and then use it to solve our business problems. This whole process of making sense of data is under the data science umbrella.
Data science is a discipline that encompasses all knowledge, methods, and technologies that help us extract value from data. The term data science first started to take shape in the 1970s. Then, people who collected, processed, analyzed, and displayed insights from data were known as statisticians. But with the start of the 21st century, when data started to become big and create vast opportunities for business discoveries, statisticians were correctly renamed into data scientists.
These were people from academic, mathematical backgrounds, who, at the dawn of data science, managed to perform all the tasks of data processing themselves. As the complexity of tasks and the volume of data needed to process increased, data scientists started focusing more on helping businesses solve problems.
Data scientists today are business-oriented analysts who know how to shape data into answers, often building complex machine learning models. Operational engineering tasks are delegated to other, more focused specialists. Imagine a scientist who received a grant for their research and can now employ a whole team to support their vision. Here, data scientists are supported by data engineers.
Data engineering itself is a process of creating mechanisms for accessing data. The movement of data from its source to analytical tools for end users requires a whole infrastructure. And although this flow of data must be automated, building and maintaining it is a task of a data engineer.
Data engineers are programmers that create software solutions with big data. They’re integral specialists in data science projects and cooperate with data scientists by backing up their algorithms with solid data pipelines.
Juxtaposing data scientist vs engineer tasks
One data scientist usually needs two or three data engineers. Depending on the complexity of a data pipeline or the experience of your data engineers, there could be more of them. It’s a common conundrum, what you definitely don’t want to have is more scientists than engineers, because that would mean the former are doing the engineering work. Though it’s not uncommon for their competencies to overlap. To understand when and how they overlap, we first need to understand each of their positions better. Then we can discuss whether one person can successfully do both.
Responsibilities: Data Scientist vs Data Engineer
As organizations are looking to adopt data-driven decision-making, they face a problem of constructing a data science team. Regardless of the structure they eventually build, it’s usually composed of two types of specialists: builders, who use data in production, and analysts, who know how to make sense of data. The distinction between data scientists and engineers is similar. Let’s explore it.
Data scientist’s responsibilities — Datasets and Models
When a business seeks to extract value from data, the first person it needs to talk to is a data scientist. A high ranking professional, a data scientist has both technical and domain knowledge, helping them freely communicate with executives about their business goals and challenges. A data scientist takes part in almost all stages of a machine learning project by making important decisions and configuring the model.
Data preparation and cleaning. Final analytics are only as good and accurate as the data they use. The lengthy but crucial process of data preparation is typically run by a data scientist, who identifies what data is needed to achieve your business goals, where to get it from, and how to make it consistent with the desired outcome. We have a detailed video listing different processing stages data goes through and we urge you to watch it.
Feature engineering. Each data point has a lot of values, called features. For example, the features of a single transaction are its time, date, monetary value, location, and more. Depending on the use case, some features are more valuable than others, having more predictive power. Data scientists need to have precise domain knowledge to extract the most relevant features for a model to use. Feature engineering is one of the most important and time consuming parts of a data scientist’s job.
Choosing an algorithm. Machine learning algorithms are designed to solve specific problems, though other conditions factor in the choice: the dataset size, the training time that you have, number of features, etc. Data scientists are well versed in algorithms and data-related problems to enable them to make a solid choice.
Model training. A data scientist splits the dataset into three subsets — for training, testing, and validation. They then feed the training set to the algorithm that provides a target value, which is usually a sought-after prediction. Depending on different attributes and the situation, they choose the training type, usually supervised or unsupervised learning.
Model testing and validating. A data scientist also optimizes the model to make sure it delivers the best possible result. They use cross-validation to measure a model’s performance against different sets of parameters and define which set delivers the most accurate predictions.
Building data visualizations. To make data more understandable, it’s often visualized. Such visualizations as graphs and charts are typically prepared by data analysts or business analysts, though not every project has those people employed. Then, a data scientist uses complex business intelligence tools to present business insights to executives.
Data engineer’s responsibilities — Development and Architecture
A data engineer’s integral task is building and maintaining data infrastructure — the system managing the flow of data from its source to its destination. This typically includes setting up two processes: an ETL pipeline, which moves data, and a data storage (typically, a data warehouse), where it’s kept. Engineers can build different types of architectures by mixing and matching these parts.
Note: Depending on project scope, goals, and steps in data processing, sometimes, a separate person is employed to build an ETL pipeline — ETL developer. We recommend referring to our article to learn more about when this person is needed.
Developing data-related instruments. Data engineers are programmers first and data specialists next, so they use their coding skills to develop, integrate, and manage tools supporting the data infrastructure: data warehouse, databases, ETL tools, and analytical systems.
Deploying machine learning models. ML models are designed by data scientists, but data engineers deploy those into production. They set up resources required by the model, create pipelines to connect them with data, manage computer resources, and monitor and configure the model’s performance.
Managing data and metadata. There are different ways to store data: a data warehouse, numerous data lakes and data hubs, etc. Data engineers control how data is stored and structured within those locations.
Providing data access tools. Often, data scientists can source data directly from storage, for example, from data lakes. But when required, data engineers set up connections for nontechnical business intelligence users, so they can generate reports and insights from data.
Monitor the infrastructure. Although the pipeline must work automatically, its performance and stability has to be monitored and modified when needed.
Skills: Data Scientist vs Data Engineer
Data scientists and engineers have to be familiar with the same technologies, but to a different degree. What matters the most here is each individual’s background. That’s why people in both roles are constantly continuing their education to close the gaps in some knowledge needed for a new project or to distinguish themselves on the job market.
Data scientist’s skills: Stats and Algorithms
Data scientists traditionally come from an academic field. About 80 percent of specialists have a master’s degree and around 30 percent of them have a Ph.D. This is not a prerequisite for entering the job market. But in a growing number of data science education programs, many active data scientists studied…data science. Computer science and statistics graduates come in second.
Though not many of them were hired for the position right after graduation, they’re often continuing education in online courses. With that in mind, it’s not uncommon for a company to grow their own data scientists from adjacent competencies: Analysts, database experts, people with coding experience in Java or C/C++ are often trained in algorithms and models to become data scientists. Let’s give a rundown of the necessary skills and what they entail.
Statistics and maths. Although statistical models and mathematical calculations are successfully performed by software, data scientists need to be able to understand the system’s capabilities and interpret the results. Besides, they more often than not use algorithms based in statistics rather than sophisticated neural networks. Linear regression, classification, and ranking are also machine learning tasks and are common in operating real-world data.
Programming. Data scientists use different programming tools to extract data, build models, and create visualizations. They use Python, R and ML libraries such as scikit-learn and TensorFlow to train models. Expected to be somewhat conversant in data engineering, they are familiar with SQL, Hadoop, and Apache Spark.
Machine learning techniques. Not all data science employs machine learning, but it’s an important branch of data science as it helps solve problems based on predictions. A data scientist needs to be competent in some ML skills such as supervised and unsupervised machine learning, time series, natural language processing, computer vision, recommendation systems, reinforcement learning, and more.
Data visualization. While data scientists are often not expected to translate data into a format that business people can easily understand, the data visualization skill is extremely sought-after in the job market. So in order to increase their competitive advantage, data scientists learn to use such tools as Tableau, Power BI, ggplot2, matplotlib, and even Excel.
Data engineer’s skills: Coding and Big Data
In striking contrast with data scientists, data engineers don’t usually start their journey by specifically wanting to become data engineers. They are typically programmers or people with other IT degrees who became interested in automation and scripting tasks, acquiring knowledge of SQL database design along the way. This interest often leads them to data science bootcamps or data engineering courses where they get to gain experience for job application.
Depending on the project, data engineers can have a wider or narrower set of responsibilities, so their skillset may vary as well. Let’s go through the main areas.
An overview of data engineer skills
Programming. Data engineers are well-versed in Java, Scala, and C++, since these languages are often used in data architecture frameworks such as Hadoop, Apache Spark, and Kafka. Python, R, and Go are used for statistical analysis and modeling, so they’re also popular among data engineers.
ETL and BI skills. Building and maintaining the Extract, Transform, and Load (ETL) process as well as integrating it with the BI platform is a data engineer’s direct responsibility, so they must know data integration technologies such as Talend, Hadoop, Oracle, Informatica, and others. For machine learning projects, they must have experience with ML libraries such as TensorFlow.
Data warehousing. SQL is the main language for building databases so it’s widely used by data engineers. noSQL storages, cloud warehouses, and other data implementations are handled via tools such as Informatica, Redshift, and Talend.
As you can see, their skills and technology knowledge is different enough that a person would have to choose which path to pursue. Let’s explore what happens when companies try to tilt the balance by employing one person for the job.
Handling the disparity
Here’s what highly regarded data engineer and writer Jesse Anderson says about companies that don’t use the resources of their data specialists correctly.
When data scientists are building data pipelines. Data scientists can (and in some companies — do) create data pipelines, though it takes them much longer to pick the right tools for the job and fix mistakes that occur when they inevitably make the wrong choice. Since they’re not trained to create systems and put things into production, they will have to bridge not only the skill gap, but also the mindset gap. In organizations where they have to do data engineering, efficiency suffers and data scientists don’t stick for long, preferring to find a place where their abilities are put to better use.
When data engineers are doing data science. This is a much rarer situation, which happens when data science processes are fairly standardized and data engineers advanced in their math and statistics skills. Though at this point, they’re no longer called data engineers. They’re machine learning engineers.
A machine learning engineer is deploying ML models in production. This fairly new role is a natural combination of data scientist and data engineer skills that allows this person to put their programming and system building mindset into machine learning tasks.
How a machine learning engineer differs from a data scientist and a data engineer
Compared to data scientists who take a more research-focused approach, ML engineers are data engineers trained to take algorithms prepared by data scientists and organize them into a production workflow. When you need to develop a machine learning pipeline to make sure your AI applications reach the end user, you won’t succeed without this specialist in your team.
In smaller companies, responsibilities don’t have to be so strictly divided, though be careful not to fix existing operating holes with people who don’t have enough skill or experience to do so.