logo

PGD Cohort 2 Enrolment is live. Call 8100551189 or Request a Call Back.

PGD Cohort 2 Enrolment is live.
Call 8100551189 or Request a Call Back.

Data Lakehouse – Powerhouse of the future

Data Engineering


16th March 2023

Thulasiram Gunipati

blog

Databricks recently published a blog post on Data lakehouse. As soon as this blog was published, the word ''lakehouse" spread like wildfire throughout the internet. One reason for the buzz may be due to the authors of the post. Stalwarts like Ben Lorica, Michael Armbrust, Ali Ghodsi, Reynold Xin and Matei Zaharia authored the post. Data lakehouse is not a new term and was already in use from the past two years. Snowflake and AWS used the term ''lakehouse" in the past.

 

This article aims to answer the following questions -

  1. What is a Data lakehouse? 
  2. What are some of the challenges with the present setup? 
  3. What challenges a Data lakehouse will resolve? 
  4. Is the technology available to us for creating a data lakehouse?
  5. What are the tools which are available to us today to provide some advantages of a lakehouse?

 

Data lakehouse

Data lakehouse is a combination of Data lake and Data warehouse. A data lake can store huge amounts of data and can provide excellent scalability. A Data Warehouse is an Online analytical processing (OLAP) system which combines different sources of data and provides a single source of truth for the organisations. It is designed to do the analysis easily on large datasets. Data lakehouse combines both the capabilities of a data lake and data warehouse. Why is it required to combine the capabilities? To understand why a lakehouse is required, we should discuss the present challenges with our data lakes and warehouses. 

Data lakes and Data warehouses

Data lakes are great. They enable us to get the data from all our different sources, provide cheap storage and take care of availability and durability. Data from the sources will be compressed and sent to the data lakes. They will arrive in a variety of formats like JSON, CSV, XML etc. The challenge arises when a data analyst wants to derive insights from the data in a data lake. The data will be in a format which is not conducive for querying and analyzing. There are ways to query the data directly from the data lake, but it is not optimized. In AWS we can use glue and create a database and tables directly on the raw data in the data lake. This will do the job but will not provide the required latency on large datasets. It is difficult to provide data governance on the raw data. Data lakes do not provide ACID transactions. Data lakes do not directly support integration with Business intelligence (BI) tools. If supported then the functionality is very limited. To sum it up, it is not easy to derive insights directly from the data lake. Now enters the data warehouse.

 

Data warehouses are relatively older technology than data lakes. The technology behind warehouses is well evolved and matured. As the data available in the data lake is not conducive for direct use, it is ingested to a data warehouse for deriving value from it. The data from the raw sources are cleaned, joined and transformed in a way which is suitable for ingestion to the warehouse. Creating the pipelines for extracting, transforming and loading (ETL) the data takes time and energy. Maintaining the ETL or Extract, Load and Transform (ELT) pipelines is also a necessary step when we ingest the data to the warehouse. Once the data is available in the warehouse then it is easy to query and use for analysis. Data warehouses will also have connectors for integration with Business Intelligence (BI) tools. The latency of the queries will be very low and the charts in BI tools will be populated quickly when using the Data warehouse. The difficulty is in creating and maintaining the ELT or ETL pipelines. There will be data redundancy as well when using the warehouse. This is because the same copy of the data will be stored in multiple formats in multiple places. For example, raw zone, conformed and modelled zone. Different organisations use different terms but you get the idea.

 

To overcome the above challenges, there is a need for Data lakehouse. Read this blog to learn Data Lakehouse in detail. 

 

Advantages of Data Lakehouse

  1. Support various data types - structured, semi-structured and unstructured data
  2. Support for ACID transactions
  3. The data in the lakehouse will be directly available for various types of workloads like data science, Machine learning, analytics, BI etc. 
  4. The latency of the results returned will be low or within acceptable limits
  5. Data Governance and auditing
  6. Direct BI support
  7. The data will be stored in one place without any redundancy
  8. Value can be derived from the data in the shortest time (time to market)
  9. No need to create and maintain costly ETL or ELT pipelines
  10. Support end to end streaming

 

Technology for the Lakehouse

Even though the term 'lakehouse' was in use before, the authors of Databricks blog post provided a new meaning and vision. According to that vision, the technology still does not exist. Some of the pieces of the technology may exist in pieces (which is discussed below) but not as a whole. Data lakehouse is futuristic. We are taking baby steps in that direction and one day we will reach the destination.

 

Tools which exist as partial solutions

  1. AWS Athena
  2. Snowflake
  3. AWS Redshift Spectrum
  4. Azure Synapse

 

 As discussed above, we can use AWS glue to crawl the data and create a catalogue for the raw data. Once the catalogue and the associated database and tables are created, it becomes available in Athena for querying. If the amount of data is huge then latencies are expected. Using AWS glue complete ETL pipelines can be set up. Glue leverages spark underneath and is serverless.

iRobot collects close to 3 TB of JSON data from IoT devices. It uses AWS glue to catalogue the raw data using AWS glue. This data becomes directly available via AWS Athena. iRobot uses Athena to query some hours or a day worth of data. They use Glue to convert the raw JSON data to parquet data. This is again catalogued and used for analyzing historical data. AWS Glue is serverless and the infrastructure is managed by AWS. Using services like glue ETL pipelines can be created very quickly. But this does not eliminate the requirement of ETL pipelines as envisioned by the Data lakehouse. 

 

Watch the above-recorded webinar to understand the rift between a data lake and warehouse. This webinar also explains how Snowflake is trying to bridge the rift which exists between data lake and data warehouse. Again, I did not use snowflake. This is coming from what I read and the videos watched.

 

 Redshift spectrum leverages Redshift clusters to query the raw data directly sitting in S3. Asim Kumar Sasmal and Maor Kleider wrote an excellent blog post about how to use spectrum for querying the data sitting in S3. Now AWS has added the feature to export the results of analysis from the spectrum, partition the data on the fly and store it in S3.

 

 

Advancing Analytics blog post informs that Azure Synapse comes very close to the capabilities expected from a Data lakehouse. Synapse is not yet available in public preview and it is very difficult to say at the moment how close Synapse comes to the concept of a data lakehouse. Check out their demo video from this webpage. Azure Synapse deserves checking out and writing a blog post about it in the future!



There are other tools which we can use - For example, Apache drill, Iceberg etc. to overcome the gap between lake and warehouse. If you have used any such tool and are satisfied with it, please let me know. You can write to us at mitra@setuschool.com

 

To summarise, Data lakehouse comes with capabilities of Data lake and Data warehouse. We can query the data directly in a lakehouse without the need for ETL or ELT pipelines. The results of queries will be returned quickly without latencies. Presently, Data lakehouse is a vision and a hope that we will ultimately reach there.

RELATED BLOGS