Data Engineering and Its Main Concepts: Explaining the Data Pipeline, Data Warehouse, and Data Engineer Role
Sharing top billing on the list of data science capabilities, machine learning and artificial intelligence are not just buzzwords – many organizations are eager to adopt them. However, the often forgotten fundamental work necessary to make it happen – data literacy, collection, and infrastructure – must be accomplished prior to building intelligent data products.
If we look at the hierarchy of needs in data science implementations suggested by Monica Rogati, we’ll see that the next step after gathering your data for analysis is data engineering. This discipline is not to be underestimated, as it enables effective data storing and reliable data flow while taking charge of the infrastructure.
Data science layers towards AI, Source: Monica Rogati
In this article, we’re going to elaborate on the details of the data flow process, explain the nuances of building a data warehouse, and describe the role of a data engineer.
What is data engineering?
Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists – data engineers – to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure preparing it for further analysis by data analysts and scientists.
You may also watch our video explainer on data engineering:
Data engineering 101
To understand data engineering in simple terms, let’s start with data sources. Within a large organization, there are usually many different types of operations management software (e.g., ERP, CRM, production systems, etc.), all of which contain different databases with varied information. Besides, data can be stored as separate files or even pulled from external sources in real time (such as various IoT devices). As the number of data sources multiplies, having data scattered all over in various formats prevents the organization from seeing the full and clear picture of their business state.
For example, it’s necessary to make sales data from its dedicated database “talk” to inventory records kept in a SQL server. This task requires extracting data out of those systems and integrating it in a unified storage where it’s collected, reformatted, and kept ready for use. Such storages are called data warehouses. Now, end-users (which include employees from different departments, managers, data scientists, business intelligence (BI) engineers, etc.) can connect to the warehouse, access the needed data in the convenient format, and start getting valuable insights from it.
The process of moving data from one system to another, be it a SaaS application, a data warehouse (DW), or just another database, is maintained by data engineers (read on to learn more about the role and skillset of this specialist).
Now, let’s look deeper into the main concepts of the data engineering domain, step by step.
ETL data pipeline
A Data pipeline is basically a set of tools and processes for moving data from one system to another for storage and further handling. It captures datasets from multiple sources and inserts them into some form of database, another tool, or app, providing quick and reliable access to this combined data for the teams of data scientists, BI engineers, data analysts, etc.
Check our video on how data science teams work
Constructing data pipelines is the core responsibility of data engineering. It requires advanced programming skills to design a program for continuous and automated data exchange. A data pipeline is commonly used for
- moving data to the cloud or to a data warehouse,
- wrangling the data into a single location for convenience in machine learning projects,
- integrating data from various connected devices and systems in IoT,
- copying databases into a cloud data warehouse, and
- bringing data to one place in BI for informed business decisions.
You can read our detailed explanatory post to learn more about data pipelines, their components, and types. Now, let’s explore what ETL stands for.
What is ETL?
Pipeline infrastructure varies depending on the use case and scale. However, data engineering usually starts with ETL operations:
1. Extracting data from source databases,
2. Transforming data to match a unified format for specific business purposes, and
3. Loading reformatted data to the storage (mainly, data warehouses).
Let’s describe these stages in more detail.
1. Extract — retrieving incoming data. At the start of the pipeline, we’re dealing with raw data from numerous separate sources. Data engineers write pieces of code – jobs – that run on a schedule extracting all the data gathered during a certain period.
2. Transform — standardizing data. Data from disparate sources is often inconsistent. So, for efficient querying and analysis, it must be modified. Having data extracted, engineers execute another set of jobs that transforms it to meet the format requirements (e.g., units of measure, dates, attributes like color or size.) Data transformation is a critical function, as it significantly improves data discoverability and usability.
3. Load — saving data to a new destination. After bringing data into a usable state, engineers can load it to the destination that typically is a relational database management system (RDBMS), a data warehouse, or Hadoop. Each destination has its specific practices to follow for performance and reliability.
NB: Despite being automated, a data pipeline must be constantly maintained by data engineers. They repair failures, update the system by adding/deleting fields, or adjust the schema to the changing needs of the business.
Data pipeline challenges
Setting up secure and reliable data flow is a challenging task. There are so many things that can go wrong during data transportation: Data can be corrupted, hit bottlenecks causing latency, or data sources may conflict, generating duplicate or incorrect data. Getting data into one place requires careful planning and testing to filter out junk data, eliminating duplicates and incompatible data types to obfuscate sensitive information while not missing critical data.
Juan De Dios Santos, an experienced practitioner in the data industry, outlines two major pitfalls in building data pipelines:
- lacking relevant metrics and
- underestimating data load.
“The importance of a healthy and relevant metrics system is that it can inform us of the status and performance of each pipeline stage, while with underestimating the data load, I am referring to building the system in such a way that it won’t face any overload if the product experiences an unexpected surge of users,” elaborates Juan.
Having an ETL pipeline that is only connected to a standard transactional database is ineffective as the volume and variety of data requires a dedicated storage design. So, besides an ETL pipeline there’s a need to build a data warehouse that supports and facilitates BI activities. Let’s see how it works.
A data warehouse (DW) is a central repository where data is stored in query-able forms. From a technical standpoint, a data warehouse is a relational database optimized for reading, aggregating, and querying large volumes of data. Traditionally, DWs only contained structured data, or data that can be arranged in tables. However, modern DWs can combine both structured and unstructured data where unstructured refers to a wide variety of forms (such as images, pdf files, audio formats, etc.) that are harder to categorize and process.
Without DWs, data scientists would have to pull data straight from the production database and may wind up reporting different results to the same question or cause delays and even outages. Serving as an enterprise’s single source of truth, the data warehouse simplifies the organization’s reporting and analysis, decision making, and metrics forecasting.
Surprisingly, DW isn’t a regular database. How so?
First of all, they differ in terms of data structure. A regular database normalizes data excluding any data redundancies and separating related data into tables. This takes up a lot of computing resources, as a single query combines data from many tables. Contrarily, a DW uses simple queries with few tables to improve performance and analytics.
Second, aimed at day-to-day transactions, standard transactional databases don’t usually store historic data, while for warehouses, it’s their main purpose, as they collect data from multiple periods. DW simplifies a data analyst’s job, allowing for manipulating all data from a single interface and deriving analytics, visualizations, and statistics.
Typically, a data warehouse doesn’t support as many concurrent users as a database, being designed for a small group of analysts and business users.
Data warehouse architecture
To construct a data warehouse, four basic components are combined.
Data warehouse storage. The foundation of data warehouse architecture is a database that stores all enterprise data allowing business users to access it for drawing valuable insights.
Data architects usually decide between on-premises and cloud-hosted DWs noting how the business can benefit from this or that solution. Although the cloud environment is more cost-efficient, easier to scale up or down, and isn’t limited to a prescribed structure, it may lose to on-prem solutions in terms of querying speed and security. We’re going to list the most popular tools further on.
A data architect can also design collective storage for your data warehouse – multiple databases running in parallel. This will improve the warehouse’s scalability.
Metadata. Adding business context to data, metadata helps transform it into comprehensible knowledge. Metadata defines how data can be changed and processed. It contains information about any transformations or operations applied to source data while loading it into the data warehouse.
Data warehouse access tools. Designed to facilitate interactions with DW databases for business users, access tools need to be integrated with the warehouse. They have different functions. For example, query and reporting tools are used for generating business analysis reports. Another type of access tools – data mining tools – automate the process of finding patterns and correlations in large amounts of data based on advanced statistical modeling techniques.
Data warehouse management tools. Spanning the enterprise, data warehouse deals with a number of management and administrative operations. That’s why managing a DW requires a solution that can facilitate all these operations. Dedicated data warehouse management tools exist to accomplish this.
For a more detailed description of different data warehouse architectures, types, and components, visit our dedicated post.
Data warehouses are a great step forward in enhancing your data architecture. However, if you have hundreds of users from different departments, DWs can be too bulky and slow to operate. In this case, data marts can be built and implemented to increase speed and efficiency.
Simply speaking, a data mart is a smaller data warehouse (their size is usually less than 100Gb.). They become necessary when the company (and the amount of its data) grows and it becomes too long and ineffective to search for information in an enterprise DW. Instead, data marts are built to allow different departments (e.g., sales, marketing, C-suite) to access relevant information quickly and easily.
The place of data marts in the data infrastructure (dependent type)
There are three main types of data marts.
Dependent data marts are created from an enterprise DW and use it as a main source of information (it’s also known as a top-down approach).
Independent data marts are standalone systems that function without DWs extracting information from various external and internal sources (it’s also known as a top-down approach).
Hybrid data marts combine information from both DW and other operational systems.
So, the main difference between data warehouses and data marts is that a DW is a large repository that holds all company data extracted from multiple sources, making it difficult to process and manage queries. Meanwhile, a data mart is a smaller storage that contains a limited amount of data for the usage of a particular business group or department.
Here is a comprehensive overview of the concept of data marts, their types, and structure if you want to learn more.
While data marts allow business users to quickly access the queried data, often just getting the information is not enough. It has to be efficiently processed and analyzed to get those actionable insights that support decision-making. Looking at your data from different perspectives is possible thanks to OLAP cubes. Let’s see what they are.
OLAP and OLAP cubes
OLAP or Online Analytical Processing refers to the computing approach that allows users to perform multidimensional data analysis. It’s contrasted with OLTP or Online Transactional Processing, which is a simpler method of interacting with databases that isn’t designed for analyzing massive amounts of data from different perspectives.
Traditional databases look like spreadsheets, using the two-dimensional structure of rows and columns. However, in OLAP, datasets are presented in multidimensional structures — OLAP cubes. Such structures enable efficient processing and advanced analysis of huge amounts of varied data. For example, a sales department report would include such dimensions as product, region, sales representative, sales amount, month, and so on.
Here’s where OLAP cubes are in the company’s data architecture. Information from DWs is aggregated and loaded into the OLAP cube where it gets precalculated and is readily available for users requests.
OLAP cubes architecture
Within OLAP, data can be analyzed from multiple perspectives. For example, it can be drilled down/rolled up if you need to change the hierarchy level of data representation and get a more/less detailed picture. You can also slice information to segment out a particular dataset as a separate spreadsheet or dice it to create a different cube. Using these and other techniques enables finding patterns in varied data and creating a wide range of reports.
It’s important to note that OLAP cubes have to be custom built for every particular report or analytical query. However, their usage is justified since, as we said, they enable advanced, multidimensional analysis that was previously too complicated to perform.
For a more detailed explanation of OLAP concepts, architecture, and main operations, consider reading our dedicated post.
ELT data pipeline and big data engineering
Speaking about data engineering, we can’t ignore the big data concept. Grounded in the four Vs – volume, velocity, variety, and veracity – big data usually floods large technology companies like YouTube, Amazon, or Instagram. Big data engineering is about building massive reservoirs and highly scalable and fault-tolerant distributed systems able to inherently store and process data.
Big data architecture differs from conventional data handling, as here we’re talking about such massive volumes of rapidly changing information streams that a data warehouse isn’t able to accommodate. The architecture that can handle such an amount of data is a data lake.
A Data lake is a vast pool for saving data in its native, unprocessed form. A data lake stands out for its high agility as it isn’t limited to a warehouse’s fixed configuration.
Big data architecture with a data lake
In contrast to the ETL architecture we described above, a data lake uses the ELT approach swapping transform and load operations. Supporting large storage and scalable computing, a data lake starts data loading immediately after extracting it, handling raw — often unstructured — data.
You can check our detailed comparison of ETL and ETL approaches , but in a nutshell, ELT is a more advanced method as it allows for significantly increasing volumes of data to be processed. It also expedites information processing (since transformation happens only on-demand) and requires less maintenance.
A data lake is worth building in those projects that are going to scale and would need a more advanced architecture. Besides, they are very convenient, for instance, when the purpose of data hasn’t been determined yet since you can load data quickly, store it, and then modify it as necessary. Once you need data, you can apply such data processing tools as Apache or MapReduce to transform it during retrieval and analysis.
Data lakes are also a powerful tool for data scientists and ML engineers, who would use raw data to prepare it for predictive analytics and machine learning. Read more about data preparation in our separate article or watch this 14-minute explainer.
Data preparation 101
Lakes are built on large, distributed clusters that would be able to store and process those massives of data. A popular example of such data lake platforms is Hadoop. Let’s look closer at what that is.
So, Hadoop is a large-scale data processing framework based on Java. This software project is capable of structuring various big data types for further analysis. The platform allows for splitting data analysis jobs across various computers and processing them in parallel.
The Hadoop ecosystem consists of the following set of tools.
Hadoop Distributed File System (HDFS). Java-based big data storage system, HDFS includes two components: NameNode stores metadata while DataNode is responsible for actual data and performs operations according to NameNode.
MapReduce. It’s a framework for writing applications that process the data stored in HDFS. Parallel in nature, MapReduce programs are effective for performing big data analysis using multiple machines in the cluster.
YARN. As the operating system of Hadoop, YARN helps manage and monitor workloads.
Hive. A system for summarizing, querying and analyzing large datasets, Hive uses its own language – HQL- which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs for execution on Hadoop.
Pig. Having similar goals with Hive, Pig also has its own language – PigLatin. When to use Pig and when to use Hive is the question. Pig is a better option for programming purposes, while Hive is mainly used by data analysts for creating reports.
HBase. A NoSQL database built on top of HDFS that provides real-time access to read or write data.
There are many other components that empower Hadoop functionality: HCatalog, Avro, Thrift, Apache Drill, Mahout, Sqoop, Flume, Ambari, Zookeeper, Oozie. You can read about their specifications in Hadoop documentation. You can also check our post dedicated to Hadoop and Spark as the main big data tools for a more detailed description.
Enterprise data hub
Besides big data capabilities, data lakes also brought new challenges for governance and security, and the risk of turning into a data swamp – a collection of all kinds of data that is neither governable nor usable. To tackle these problems, a new data integration approach emerged – data hub – where data is physically moved and re-indexed into a new system.
Enterprise data hubs (EDHs) are the next generation of data management that evolved from DWs and data lakes, enabling easier data consolidation, harmonization, processing, governance, and exploration.
In the data hub architecture, data from many operational and analytic sources is acquired through replication and/or publish-and-subscribe interfaces (read about pub/sub messaging pattern in our explainer.). As data changes occur, replication uses changed data capture (CDC) to continuously populate the hub, while publish-and-subscribe allows the hub to subscribe to messages published by data sources. EDH data-centric storage architecture enables executing applications where the data resides.
So what are the benefits of EDH over traditional data consolidation? Let’s have a closer look.
Easy connection of new data sources. Contrary to DW that lacks flexibility, modern EDH can connect multiple systems on the fly, integrating diverse data types.
Up-to-date data. Outdated data can be an issue with a DW, but EDH overcomes it, presenting fresh data ready for analysis right after capturing it.
Tools. EDH offers powerful tools for processing and analyzing data.
Rapid deployment. While a DW system deploy can last months and even years, EDH deployment is a matter of days or weeks.
To sum it all up, a data warehouse is constructed to deal mainly with structured data for the purpose of self-service analytics and BI; a data lake is built to deal with sizable aggregates of both structured and unstructured data to support deep learning, machine learning, and AI in general; and a data hub is created for multi-structured data portability, easier exchange, and efficient processing.
An EDH can be integrated with a DW and/or a data lake to streamline data processing and deal with the common challenges these architectures face.
Role of data engineer
Now that we know what data engineering is concerned with, let’s delve into the role that specializes in creating software solutions around big data – a data engineer.
Juan De Dios Santos, a data engineer himself, defines this role in the following way: “In a multidisciplinary team that includes data scientists, BI engineers, and data engineers, the role of the data engineer is mostly to ensure the quality and availability of the data.” He also adds that a data engineer might collaborate with the others at the time of implementing or designing a data-related feature (or product) such an A/B test, the deployment of a machine learning model, and the refinement of an existing data source.
We have a separate article explaining the role, responsibilities, and skills of a data engineer, so here we’ll only briefly recap.
Skills and qualifications
Overlapping skills of the software engineer, data engineer, and data scientist, Source: Ryan Swanstrom
Software engineering background. Data engineers use programming languages to enable clean, reliable, and performative access to data and databases. Juan points out their ability to work with the complete cycle of software development including ideation, architecture design, deployment and DevOps, prototyping, testing, defining metrics, alerts, and monitoring systems. Data engineers are experienced programmers in at least Python or Scala/Java.
Data-related skills. “A data engineer should have knowledge of multiple kinds of databases (SQL and NoSQL), data platforms, concepts such as MapReduce, batch and stream processing, and even some basic theory of data itself, e.g., data types, and descriptive statistics,” underlines Juan.
Systems creation skills. Data engineers need to have experience with various data storage technologies and frameworks they can combine to build data pipelines.
Data engineering process involves using different data storage and manipulation tools together. So a data engineer should have a deep understanding of many data technologies to be able to choose the right ones for a certain job.
Tools for writing ETL/ELT pipelines:
Airflow. This Python-based workflow management system was initially developed by Airbnb to rearchitect their data pipelines. Migrating to Airflow, the company reduced their experimentation reporting framework (ERF) run-time from 24+ hours to about 45 minutes. Airflow’s key feature is automating scripts to perform tasks. Among the Airflow’s pros, Juan highlights its operators: “They allow us to execute bash commands, run a SQL query or even send an email”. Juan also stresses Airflow’s ability to send Slack notifications, complete and rich UI, and the overall maturity of the project. On the contrary, Juan dislikes that Airflow only allows for writing jobs in Python.
Cloud Dataflow. A cloud-based data processing service, Dataflow is aimed at large-scale data ingestion and low-latency processing through fast parallel execution of the analytics pipelines. Dataflow has a benefit over Airflow as it supports multiple languages like Java, Python, SQL, and engines like Flink and Spark. It is also well maintained by Google Cloud. However, Juan warns that Dataflow’s high cost might be a disadvantage for some.
Kafka. From a messaging queue to a full-fledged event streaming platform, Apache Kafka distributes data across multiple nodes for a highly available deployment within a single data center or across multiple availability zones. As an abstraction of a distributed commit log, it provides durable storage.
Kafka and data streaming, explained
Warehouse solutions. Widely used on-premise data warehouse tools include Teradata Data Warehouse, SAP Data Warehouse, IBM db2, and Oracle Exadata. Most popular cloud-based data warehouse solutions are Amazon Redshift and Google BigQuery. Be sure to check our detailed comparison of the top cloud warehouse software.
Big data tools. Big data technologies that a data engineer should be able to utilize (or at least know of) are Hadoop, distributed file systems such as HDFS, search engines like Elasticsearch, ETL and data platforms: Apache Spark analytics engine for large-scale data processing, Apache Drill SQL query engine with big data execution capabilities, Apache Beam model and software development kit for constructing and running pipelines on distributed processing backends in parallel.
Closing: Data engineer vs data scientist
It’s not rare that a data engineer is confused with a data scientist. We asked Alexander Konduforov, a data scientist at AltexSoft, with over 10 years of experience, to comment on the difference between these two roles:
“Both data scientists and data engineers work with data but solve quite different tasks, have different skills, and use different tools. Data engineers build and maintain massive data storage and apply engineering skills: programming languages, ETL techniques, knowledge of different data warehouses and database languages. Whereas data scientists clean and analyze this data, get valuable insights from it, implement models for forecasting and predictive analytics, and mostly apply their math and algorithmic skills, machine learning algorithms and tools.”
Alexander stresses that accessing data can be a difficult task for data scientists for several reasons.
- Vast data volumes require additional effort and specific engineering solutions to access and process it in a reasonable amount of time.
- Data is usually stored in lots of different storages and formats. In this case, it makes sense first to clean it up by taking dataset preparation measures, transform, merge, and move to a more structured storage, like a data warehouse. This is typically a task for data architects and engineers.
- Data storages have different APIs for accessing them. In this case, data scientists need data engineers to implement the most efficient and reliable pipeline of getting data for their purpose.
As we can see, working with data storages built by data engineers, data scientists become their “internal clients.” That’s where their collaboration takes place.