Data lakes are repositories used to store massive amounts of data, typically for future analysis, big data processing, and machine learning. A data lake can enable you to do more with your data. However, if you are not careful when designing, implementing, and maintaining your data lake, it can quickly turn into a useless data swamp.
In this article, you will learn what is a data lake, what Extract, Load, Transform (ELT) is, and what the connection between these two concepts is. You will also learn what the essential building blocks of a data lake architecture are, and what cloud-based data lake options are available on AWS, Azure, and GCP.
What Is a Data Lake?A data lake is a central repository used to store both structured and unstructured data for use in analytics, big data processing, and machine learning. It is different from a data warehouse that is only used to store structured, processed data. However, there's a new data architecture known as a data lakehouse that combines features of both a data warehouse and a data lake.
Essential characteristics of a data lake include:
- Data movement—you can import large amounts of data in real-time. This data can come from multiple sources and doesn’t require any processing or transformation before storage.
- Availability—the lake is accessible from a variety of platforms and can be accessed by multiple users at once. Users can locate specific data in a lake through crawling, cataloging and indexing.
- Flexibility—analytics and machine learning platforms can access data directly and data can be used in place rather than requiring export.
Why Do You Need a Data Lake?Data lakes enable you to gain maximum benefit from your data by expanding the types of data you can add to your data pipeline and how you can use it. Through centralization, lakes enable you to ensure that all operations are being performed with the same data, ensuring reliability. When data is more easily and readily available, insights can be derived and applied faster.
What Is Extract, Load, Transform (ELT)?Extract, load, transform (ELT) is a multi-step process used to ingest raw data and prepare it for downstream use. In the first step, data sources are identified and data is ingested into an ELT pipeline. Next, the data is loaded, as-is, into the data lake or storage resource. Finally, the data is transformed into whatever format is needed for analysis.
Transformation may include:
- Anonymizing data
- Aggregating calculations
- Converting data types or schemas
- Combining data from multiple sources
Data Lake ArchitectureWhen creating a data lake, you can create it as a single repository. However, it is more common to create lakes from multiple layers or services. This helps keep data better organized and can improve efficiency of use.
Commonly used layers include:
- Ingestion Layer — this is the basic layer used for raw data as it is loaded into the lake. From this layer, data can be accessed, organized, or processed as needed.
- Standardized layer — this layer is used to store data that has been transformed or is already in a format likely to be needed. It is useful when you have data formats that are commonly required and can be used to avoid performing multiple identical transformations.
- Curated layer — this layer contains cleansed data that has been transformed into consumable data sets. Typically, this layer also includes data that has been consolidated from multiple data sources. The curated layer is the layer most often accessed by users.
- Production layer — also called the application or trusted layer. This layer contains data with applied business logic. For example, row-level security or surrogate keys. It contains data that is ready to be consumed by applications.
- Sandbox layer—containing in-process or temporary data transformations, sandbox layer enables analysts and data scientists to experiment with data without impacting other users.
Additional ComponentsWhile layers make up the bulk of a data lake, there are other components that are also required. These include:
- Security — data lakes hold massive amounts of valuable data and are designed to be highly accessible. In order to ensure that your data is only accessible to the users and the applications only in the ways you want, you must implement security tools. These should include permissions management, authentication, and encryption tools.
- Governance—you need tooling that enables you to monitor data lake performance, capacity, and use. All operations need to be logged to ensure that events are traceable.
- Orchestration—orchestration tools enable you to manage ELT processes, control job scheduling, and ensure that data remains accessible to users and applications. This is particularly important if you are using a distributed data lake composed of multiple resources or cloud services.
Cloud Data Lake Architectures: The Big ThreeIt should be no surprise that many data lakes rely on cloud services. Much of the data organizations are using comes from cloud services and many organizations are already running workloads in the cloud. To meet customer demand, all three major cloud providers offer a variety of services you can use to create a data lake in the cloud.
Data Lake on AWSAlthough AWS does not offer a specific data lake service, it does provide an automated reference implementation that you can use to provision and deploy a lake from a combination of services. Although this requires more expertise, you can also choose to construct a data lake that is customized to your needs by selecting and integrating services.
For many organizations, the offered solution, called Data Lake on AWS, is the best starting option. It includes a console and CLI that you can use to search for and manage data along with the following services — Cognito, Lambda, API Gateway, S3, DynamoDB, Elasticsearch Service, Glue, and CloudWatch.
Azure Data LakeAzure Data Lake is a service that you can use to construct a data lake from three optimized services:
- HDInsight — a service that enables you to run open-source analytics frameworks, including Hadoop, Spark, and Kafka.
- Data Lake Analytics — an analytics job service that enables you to develop and run processing and transformation programs. It supports programs written in R, Python, .NET, and U-SQL.
- Azure Data Lake Storage — based on Azure Blob Storage that is optimized for analytics workloads.
Google data lakesGoogle does not offer a specific service for data lakes but does provide a variety of services that can be integrated to create a lake. While creating a lake in GCP is not as user friendly as other cloud services, it can offer greater flexibility.
To build your lake in GCP, you need to use Google Cloud Storage as the base. You can then apply a variety of services, including Dataflow, Cloud Pub/Sub, and Storage Transfer Service to ingest data. For analytics, you can use Cloud Dataproc, which includes a managed Hive service, BigQuery, and Cloud Datalab, which includes managed Jupyter notebooks.
Successful Data Lake in ActionAs an example of the difference data lakes can make, you can look at a lake implementation created by EMC, a developer of storage and analytics technologies.
EMC had the necessary data but was unable to gather it in a way that allowed holistic and strategic use. Due to company growth and the incorporation of other businesses, much of their data was siloed and data management was inconsistent. Additionally, when business units needed to report on data, they had to turn to IT, creating bottlenecks and reducing efficiency.
To eliminate these issues, the company designed an in-house lake from their Greenplum Data Computing Appliance and Isilon Scale-Out NAS. These systems were based on Intel hardware and supported by custom algorithms. The company also incorporated the use of sandboxes that various business units could use for self-service analytics and data manipulation.
Once data was joined in a lake, analysts were able to access all data uniformly and IT was better able to manage fidelity. As a result, EMC was able to reduce data queries from hours to under a minute and boost the accuracy of predictive models.
ConclusionThere is a wide range of architectures you can use when building your data lake, but most should contain five layers for ingestion, standardization, curation, production, and sandboxing. Additional highly recommended components include security, governance, and orchestration components. You can set up your data lake in the cloud or on-premise.
Each cloud vendor offers different data lake components and features. Data lakes on AWS come with automation capabilities. Azure data lakes are typically composed of three services — HDInsight, Data Lake Analytics, and Azure Data Lake Storage. When using GCP, you will need to manually architect your data lake using available GCP resources.
Remember that a properly configured data lake should enable you to import big data in real time from multiple sources, while providing high availability and flexibility. Whether you are setting up your data lake in the cloud or on-prem, if you want to avoid the dreaded data swamp, be sure that your data lake provides the three basic data lake characteristics.
Farhan Munir - With over 12 years of experience in the technical domain, I have witnessed the evolution of many web technologies, as well as the rise of the digital economy. I consider myself a life-long learner, and I love experimenting with new technologies. I embrace challenges with enthusiasm and outside-of-the-box mindset. I feel it is important to share your experiences with the rest of the world - in order to pass on the knowledge or let other folks learn from your mistakes or successes. In my spare time, I like to travel and photograph the world.
Want to write an article for our blog? Read our requirements and guidelines to become a contributor.