What is Data Engineer: Role Description, Responsibilities, Skills, and Background
With an incredible 2.5 quintillion bytes of data generated daily, data scientists are busier than ever. The more information we have, the more we can do with it. And data science provides us with methods to make use of this data. So, while you search for the definition of “quintillion,” Google is probably learning that you have this knowledge gap.
But, understanding and interpreting data is just the final stage of a long journey, as the information goes from its raw format to fancy analytical boards. Processing data systematically requires a dedicated ecosystem known as a data pipeline: a set of technologies that form a specific environment where data is obtained, stored, processed, and queried. So, along with data scientists who create algorithms, there are data engineers, the architects of data platforms.
In this article we’ll explain what a data engineer is, their scope of responsibilities, skill sets, and general role description. We’ll also describe how data engineers are different from other related roles.
What is a data engineer?
We’ll go from the big picture to details. Data engineering is a part of data science, a broad term that encompasses many fields of knowledge related to working with data. At its core, data science is all about getting data for analysis to produce meaningful and useful insights. The data can be further applied to provide value for machine learning, data stream analysis, business intelligence, or any other type of analytics.
While data science and data scientists in particular are concerned with exploring data, finding insights in it, and building machine learning algorithms, data engineering cares about making these algorithms work on a production infrastructure and creating data pipelines in general. So, a data engineer is an engineering role within a data science team or any data related project that requires creating and managing technological infrastructure of a data platform.
The role of data engineer
The role of a data engineer is as versatile as the project requires them to be. It will correlate with the overall complexity of a data platform. If you look at the Data Science Hierarchy of Needs, you can grasp a simple idea: The more advanced technologies like machine learning or artificial intelligence are involved, the more complex and resource-heavy data platforms become.
The growing complexity of data engineering compared to the oil industry infrastructure
Original picture: hackernoon.com
To give you an idea of what a data platform can be, and which tools are used to process data, let’s quickly outline some general architectural principles. There are three main functions a data infrastructure.
- Extracting data: The information is located somewhere, so first we have to extract it. In terms of corporate data, the source can be some database, a website’s user interactions, an internal ERP/CRM system, etc. Or the source can be a sensor on an aircraft body. Or the data may come from public sources available online.
- Data storing/transition: The main architectural point in any data pipeline is storages. We need to store extracted data somewhere. In data engineering, the concept of a data warehouse embodies an ultimate storage for all data gathered for analytical purposes.
- Transformation: Raw data may not make much sense to the end users, because it’s hard to analyze in such form. Transformations aim at cleaning, structuring, and formatting the data sets to make data consumable for processing or analysis. In this form, it can finally be taken for further processing or queried from the reporting layer.
One of the various architectural approaches to data pipelines
Classical architecture of a data pipeline revolves around its central point, a warehouse. But, the presence of a unified storage isn’t obligatory, as analysts might use other instances for transformation/storage purposes. Or they can use no storage at all. So, the number of instances that are in between the sources and data access tools is what defines the data pipeline architecture.
The responsibilities of a data engineer can correspond to the whole system at once or each of its parts individually.
General-role. A data engineer found on a small team of data professionals would be responsible for every step of data flow. So, starting from configuring data sources to integrating analytical tools — all these systems would be architected, built, and managed by a general-role data engineer.
Warehouse-centric. Historically, the data engineer had a role responsible for using SQL databases to construct data storages. This is still true today, but warehouses themselves became much more diverse. So, there may be multiple data engineers, and some of them may solely focus on architecting a warehouse. The warehouse-centric data engineers may also cover different types of storages (noSQL, SQL), tools to work with big data (Hadoop, Kafka), and integration tools to connect sources or other databases.
Pipeline-centric data engineers would take care of data integration tools that connect sources to a data warehouse. These tools can either just load information from one place to another or carry more specific tasks. For example, they may include data staging areas, where data arrives prior to transformation. Managing this layer of the ecosystem would be the focus of a pipeline-centric data engineer.
Data engineer responsibilities
Regardless of the focus on a specific part of a system, data engineers have similar responsibilities. This is mostly a technical position that combines knowledge and skills of computer science, engineering, and databases.
Architecture design. In its core, data engineering entails designing the architecture of a data platform.
Development of data related instruments/instances. As a data engineer is a developer role in the first place, these specialists use programming skills to develop, customize and manage integration tools, databases, warehouses, and analytical systems.
Data pipeline maintenance/testing. During the development phase, data engineers would test the reliability and performance of each part of a system. Or they can cooperate with the testing team.
Machine learning algorithm deployment. Machine learning models are designed by data scientists. Data engineers are responsible for deploying those into production environments. This entails providing the model with data stored in a warehouse or coming directly from sources, configuring data attributes, managing computing resources, setting up monitoring tools, etc.
Manage data and meta-data. The data can be stored in a warehouse either in a structured or unstructured way. Additional storage may contain meta-data (exploratory data about data). A data engineer is in charge of managing the data stored and structuring it properly via database management systems.
Provide data-access tools. In some cases, such tools are not required, as warehouse types like data-lakes can be used by data scientists to pull data right from storage. However, if an organization requires business intelligence for analysts and other non-technical users, data engineers are responsible for setting up tools to view data, generate reports, and create visuals.
Track pipeline stability. Monitoring the overall performance and stability of the system is really important as long as the warehouse needs to be cleaned from time to time. The automated parts of a pipeline should also be monitored and modified since data/models/requirements can change.
Data engineer skills
Skills for any specialist correlate with the responsibilities they’re in charge of. The skill set would vary, as there is a wide range of things data engineers could do. But generally, their activities can be sorted into three main areas: engineering, data science, and databases/warehouses.
Skill set of a data engineer broken by domain areas
Engineering skills. Most tools and systems for data analysis/big data are written in Java (Hadoop, Apache Hive) and Scala (Kafka, Apache Spark). Python along with Rlang are widely used in data projects due to their popularity and syntactical clarity. High-performant languages like C/C# and Golang are also popular among data engineers, especially for training and implementing ML models.
- Software architecture background
Data related expertise. Data engineers would closely work with data scientists. Strong understanding of data modeling, algorithms, and data transformation techniques are the basics to work with data platforms. Data engineers will be in charge of building ETL (data extraction, transformation, and loading), storages, and analytical tools. So, experience with the existing ETL and BI solutions is a must.
More specific expertise is required to take part in big data projects that utilize dedicated instruments like Kafka or Hadoop. If the project is connected with machine learning and artificial intelligence, data engineers must have experience with ML libraries and frameworks (TensorFlow, Spark, PyTorch, mlpack).
- Strong understanding of data science concepts
- Expertise in data analysis
- Hands-on experience with ETL tools
- BI tools knowledge
- Big data technologies: Hadoop and Kafka
- ML frameworks and libraries: TensorFlow, Spark, PyTorch, mlpack
Database/warehouse. In most cases, data engineers use specific tools to design and build data storages. These storages can be applied to store structured/unstructured data for analysis or plug into a dedicated analytical interface. In most cases, these are relational databases, so SQL is the main thing every data engineer should know for DB/queries. Other instruments like Talend, Informatica, or Redshift are popular solutions to create large distributed data storages (noSQL), cloud warehouses, or implement data into managed data platforms. So, the key tools are:
As we already mentioned, the level of responsibility would vary depending on team size, project complexity, platform size, and the seniority level of an engineer. In some organizations, the roles related to data science and engineering may be much more granular and detailed. Let’s have a look at the key ones and try to define the differences between them.
Data specialists compared: data scientist vs data engineer vs ETL developer vs BI developer
Data scientists are usually employed to deal with all types of data platforms across various organizations. Data engineers, ETL developers, and BI developers are more specific jobs that appear when data platforms gain complexity. And the more complex a data platform is, the more granular the distribution of roles becomes. For instance, the organizations in the early stages of their data initiative may have a single data scientist who takes charge of data exploration, modeling, and infrastructure. As the complexity grows, you may need dedicated specialists for each part of the data flow.
Data scientists are the basis for most data-related projects. These are the specialists knowing the what, why, and how of your data questions. They would provide the whole team with the understanding of what data types to use, what data transformations must happen, and how it will be applied in the future. The input provided by data scientists lays the basis for the future data platform. Plainly, data scientist would take on the following tasks.
- Define required data types
- Find data sources/mine data
- Define data gathering techniques
- Clean/prepare data sets
- Manage meta-data
- Set standards for data transformation/processing
- Develop machine learning models
- Define processes for monitoring and analysis
A data engineer is a technical person who’s in charge of architecting, building, testing, and maintaining the data platform as a whole. Depending on the project, they can focus on a specific part of the system or be an architect making strategic decisions. In the case of a small team, engineers and scientists are often the same people. But as a separate role, data engineers implement infrastructure for data processing, analysis, monitoring applied models, and fine-tuning algorithm calculations.
An ETL developer is a specific engineering role within a data platform that mainly focuses on building and managing tools for Extract, Transform, and Load stages. So, the border between a data engineer and ETL developer is kind of blurred. However, an ETL developer is a narrower specialist rarely taking architect/tech lead roles. These tasks typically go to an ETL developer.
- ETL process management
- Data warehouse architecting
- Data pipeline (ETL tools) development
- ETL testing
- Data flow monitoring
A business intelligence developer is a specific engineering role that exists within a business intelligence project. Business intelligence (BI) is a subcategory of data science that focuses on applying data analytics to historical data for business use. While a data engineer and ETL developer work with the inner infrastructure, a BI developer is in charge of
- defining reporting standards,
- developing reporting tools and data access tools,
- constructing interactive dashboards,
- developing data visualization tools,
- implementing OLAP cubes,
- testing warehouse architecture,
- validating data,
- testing user interface, and
- testing data querying process.
So, theoretically the roles are clearly distinguishable. In practice, the responsibilities can be mixed: Each organization defines the role for the specialist on its own. Everything depends on the project requirements, the goals, and the data science/platform team structure. The bigger the project, and the more team members there are — the clearer responsibility division would be. And vice versa, smaller data platforms require specialists performing more general tasks.
When to hire a data engineer?
There are several scenarios when you might need a data engineer.
Scaling your data science team. Here’s a general recommendation: When your team of data specialists reaches the point when there is nobody to carry technical infrastructure, a data engineer might be a good choice in terms of a general specialist.
Big data projects. Currently, data engineering shifts towards projects that aim at processing big data, managing data lakes, and building expansive data integration pipelines for noSQL storages. In this case, a dedicated team of data engineers with allocated roles by infrastructure components is optimal.
Requiring custom data flows. Even for medium-sized corporate platforms, there may be the need for custom data engineering. Extract, Transform, Load is just one of the main principles applied mostly to automated BI platforms. In practice, a company might leverage different types of storages and processes for multiple data types. This involves a large technological infrastructure that can be architected and managed only by a diverse data specialist. A data engineer in this case is much more suitable than any other role in the data domain.