Dedicated practitioners of data engineering and data science are today’s gold miners who find new methods to collect, process, and store data.
Data engineering, explainedUsing specific tools and practices, businesses implement these methods to generate valuable insights. One of the most common ways enterprises leverage data is business intelligence (BI), a set of practices and technologies that transform raw data into actionable information.
The data can be used for various purposes: to do analytics or create machine learning models. But it can’t be used in its raw format. Any system dealing with data processing requires moving information from storage and transforming it in the process to be used by people or machines. This process is known as Extract, Transform, Load, or ETL. And usually, it is carried out by a specific type of engineer — an ETL developer.
In this article, we will discuss the role of an ETL developer in a data engineering team. We will cover their main responsibilities and skillset while debunking common misinterpretations of an ETL developer and related roles.
Who is ETL Developer?An ETL developer is a type of software engineer that manages the Extract, Transform, and Load processes, implementing technical solutions to do so. The process is broken down into three main stages:
Extract. Businesses store historical information or stream real-time data into many systems. This information is scattered across different software and is structured in various formats. The extraction phase entails defining required data sources, whether it is an ERP, CRM, or third-party system, and gathering data from them.
Transform. When the data is gathered from its sources, it’s usually placed in a temporary storage called a Staging Area. While placed in this area, the data is formatted in accordance with defined standards and models. For example, financial numerics of different formats $34.50, 0.90 cents, 01,65 will be changed into a single coherent format: $34.50, $0.90, $1.65.
Load. The final stage of an ETL process is loading the structured and formatted data into a database. If the amount of data is small, any kind of database can be used. A specific type of database used in BI, big data processing, and machine learning is called a Data Warehouse.
A warehouse is different from a usual database by its structure: It may include several tools to represent data from multiple dimensions and make it accessible for each user. Data representation tools are connected to a warehouse so that users can drag it out and manipulate it. The representation tools are the actual BI tools that offer analytical data through interactive dashboards and reporting tools.
Data processing in a nutshell and ETL steps outlineThe data travels through numerous technical forms before getting to its final destination to be accessed by the user. To move data, we need to build the pipeline, and that is exactly the job of
Usually, an ETL developer is a part of a data engineering team — the cool kids on the block in charge of data extraction, processing, storing, and maintaining the corresponding infrastructure. The main task of the data engineering team is to obtain raw data, decide how it should look to be consumable, make it consumable, and then store somewhere.
The roster of a team depends on the scope of the project, goals, steps of data processing, and required technologies. Thus, the data engineering team may include the following roles:
- Data architect. One can be a part of a data science or data engineering team. The data architect’s role is to project infrastructure that data engineers will develop.
- Data engineer. This is a specific type of software engineer who develops interfaces and the ecosystem to gain access to the information.
- Data analyst. This team member is the one who defines data collection methods, data models, types, and outlines the transformation process.
- Database/Warehouse developer. Data as any other information has to be stored somewhere. It can be a usual SQL database, or a special type of storage, Data Warehouse. The DB/warehouse developer is responsible for the modeling, development, and maintenance of data storages.
- DBA or Database Administrator. This is a person in charge of database management if there are multiple databases, or the structure of a DB/warehouse, as complex as rocket science.
- Data scientists. Projects dealing with machine learning also include a data science specialist or even a dedicated department.
- Business intelligence developer. This is a software engineer that focuses on developing BI interfaces.
- ETL developer is a software engineer that covers the Extract, Transform, and Load stage of data processing by developing/managing the corresponding infrastructure.
How different roles in data science teams workFocusing on BI projects as the main business use case of data processing, let’s elaborate on the duties of an ETL developer.
What are the responsibilities of an ETL developer?In terms of a BI project, an ETL developer is one of the major engineering roles. While the main responsibility is to take charge of the Extract, Transform, Load stage, an ETL developer performs tasks connected with data analytics, testing, and system architecture. To outline all the possible tasks an ETL developer can perform, we will cover shortly all the possible tasks:
- ETL process management
- Data modeling
- Data warehouse architecture
- Data pipeline (ETL tools) development
- ETL testing
ETL process managementETL is one of the key stages in data processing. It has both methodologies and technologies used in it. The main task of an ETL developer, or a dedicated team of developers, is to:
- Outline the ETL process, setting the borders of data processing
- Provide system architecture for each element and the whole data pipeline
- Document the requirements of the system and manage its development
- Take part in the actual development/implementation of ETL tools
- Conduct testing of the tools and data pipelines
We contacted Wayne Yaddow, a Data Quality Trainer and BI consultant, to elaborate on ETL developer duties. Wayne suggests numerous points to consider: “An ETL developer is in charge of analyzing, interpreting data models, and resolving data issues in collaboration with data analysts. All system designs and publication activities pass the validation of an ETL developer. Consider also requirement validation, conducting user interviews with system users, developers, and monitoring of business requirements as for the ETL process.”
Data ModelingBefore the data is extracted from its sources, an ETL developer should define which formats are required. The end formats of data that will be represented in the warehouse (and user interface) are called data models.
Logical data model example
Source: 1keydata.comData models are constructed and documented by collaborating with business analysts, data analysts, and data scientists. The models will be used by an ETL developer to define the transformation stage and underlying technologies that will perform formatting.
Data Warehouse ArchitectureA warehouse is a large storage facility used to save structured data. It’s often decomposed into smaller elements like data marts. Data marts are used to offer dedicated departments access to required data with specific properties. E.g. if a warehouse is a large storage area with all the information gathered, data marts are smaller databases that store thematic data (accounting, website metrics, etc.)
The warehouse itself or data marts are connected to the end-user interface, helping users access the information, manipulate it, make queries, and form reports. Additionally, the data can be enriched with metadata during the formatting stage, which also involves changes in the overall warehouse architecture.
Warehousing architecture and data representation toolsAn ETL developer is responsible for defining data warehouse architecture as well as tools to load data into it. Warehousing is a complex process, and its development is usually carried out by a dedicated type of a database developer. However, an ETL developer can possess all the required skills and knowledge to build it.
Data Pipeline DevelopmentThe final stage after each system component is designed on its own is the development of the data pipeline. A data pipeline is a technical infrastructure that will automatically perform the following actions as a single system:
Data extraction from a given sources. As long as the information is stored in various systems, an ETL tool should be integrated with each of them.
Data uploading into a staging area. The staging area is where the formatting happens. In some cases, it can be done in the warehouse, but most often a separate database is used to speed up the process and keep the warehouse clean.
Data formatting. When data is transferred to the staging area, it is formatted to meet the defined standards. This may include such operations as:
- Data cleansing, the process of deleting useless data fields
- Data structuring/mapping, the process of defining data types and the connections between them
- Adding metadata to enrich the information with details
ETL testingIn the course of development, an ETL developer is in charge of testing the system, units, data models, and warehouse architecture. Besides usual QA activities, ETL testing has the following aspects to be checked:
- Data model testing
- Data warehouse architecture testing
- Representation tools check
- Data flow validation
- Uploading/downloading/querying speed testing
- System performance tests
To perform all the above-mentioned duties, an ETL developer should possess a rich technical background. So now we are moving to the actual skillset an ETL developer has.
The skillset of an ETL developerAn ETL developer is a discipline-specific role that requires expertise in several fields. An ETL developer has a software engineering background and experience in database development. That is the general makeup of an ETL developer, but let’s narrow this down to specific skills.
The list of ETL developer’s skills required to execute corresponding responsibilitiesExperience with ETL tools. As data engineering is a mature industry, there are a lot of ready-made solutions on the market. Instruments like Talend, Informatica, and Pentaho are considered industry standards. So, the experience with these tools for an ETL developer is like Photoshop for a designer.
ETL tools are the out-of-the-box solutions that can perform Extract, Transform, Load steps right from the start. An ETL developer, in this case, is the administrator of the data integration tool connected to data sources and the warehouse. The task is to integrate existing instruments with the ETL tool, manage operations, and implement an interface to make data usable.
Database/DBA/Architect background. An ETL developer must be a specialist in database engineering. To understand data storage requirements and design warehouse architecture, an ETL developer should have the expertise with SQL/NoSQL databases and data mapping. There are also instruments like Hadoop, which is both the framework and the platform used in ETL as a data integration tool.
Data analysis expertise. As long as an ETL developer takes part in data modeling, mapping, and formatting, data analysis expertise is required.
Knowledge of scripting languages. If you deal with large datasets and complex pipelines, you’ll need some automation. ETL developers may use scripts to automate small parts of the process. The most popular scripting languages for ETL are Bash, Python, and Perl.
Troubleshooting. Data processing systems operate with large amounts of data and include multiple structural elements. An ETL developer is responsible for the proper functioning of the system, which requires strong analytical thinking and troubleshooting skills.
When to hire an ETL developer?An ETL developer intersects with other data engineering roles, because of the technical background and the field of responsibilities. The main recommendation is to hire an ETL developer when you’re building a large-scale data processing system and the data flow is complex. In such case, an ETL developer would be irreplaceable.
Warehouse developer instead of ETL developer: If the focus of your system falls on the operations inside the data warehouse, a warehouse developer is a good alternative to an ETL developer. A dedicated database/warehouse developer with strong expertise in BI projects can carry out data flow implementation, as well as use data integration tools.
Business intelligence developer instead of ETL developer: BI developers are software engineers that specialize in implementing BI interfaces that rely on data pipelines. If your project focuses on the ready-made solutions, and integrations with providers of business intelligence tools, a BI developer will be up to the task of maintaining the system.