DLD #1 | Data Landscape Digest 🗞️

Curated Knowledge on Data Engineering Landscape

Aug 25, 2024

Introduction

Welcome to the Data Engineering Digest (DLD) newsletter series. This will be a periodic roundup of the latest and greatest in the world of data & data engineering in particular. We'll deliver a curated selection of finest news articles, blog posts, tutorials, discussions, and more within the data landscape.

✨ Featured

In a recent blog post I explored the difference between generic data lakehouses offered by some vendors, and open data lakehouses built on the foundation of open source tools and technologies. The Apache Hudi blog published a post providing an overview of the open data lakehouse architecture. It details the architecture layers, components, key technologies, advantages over previous data lake architectures, and use cases for implementing an open data lakehouse. —> Read more

📡 Open Source News

👉 Debezium Latest Official Release

Debezium 2.7.0.Final has been released. Some outstanding 140 issues have been fixed along with many new features and improvements in the core component as well as the stand-alone connectors. —> Read more

👉 Submission of OneTable to Apache Foundation

OneHouse news on submission of XTable (formerly known as OneTable) to the Apache Software Foundation Incubator was a major recent announcement. Top cloud vendors such as Microsoft and Google are already integrating XTable into their analytics platforms, Microsoft Fabric and BigLake respectively, to provide a unified logical lakehouse with interoperability between different open table formats. —> Read more

🛠 Practical Data Engineering

👉 Spark Repartition() Function

Spark pipelines often employ the df.repartition() function to optimise data processing, especially by consolidating small partitions before loading data into target storage. It's essential to remember that repartitioning in Spark is a unit of parallelism and data distribution used by the distributed compute engine, not necessarily a full bucketing or SQL-style group by operation. A blog post explains how Spark repartitioning works and what it actually does. --> Read more

👉 Apache XTable + Airflow

AWS published a blog post demonstrating how to use Apache XTable to convert open table format metadata to other formats. The blog post features a custom Airflow operator, XtableOperator(), designed for batch pipeline translations on the AWS platform. The operator's code is available on Github. This development suggests that unified open table format adoption is gaining some momentum. —> Read more

⚙️ Technical Deep Dive

👉 Apache Paimon’s Internal Design

While the 'big three' open table formats – Apache Hudi, Iceberg, and Delta Lake – dominate the market and discussions, Apache Paimon, a more recent “Flink table format”, has received less attention. If you're interested in learning more about Apache Paimon, a comprehensive blog post by Giannis delves into its design goals, internals, and key features. —> Read more

👉 How dlt Works Under the Hood

Discussions have been ongoing regarding the potential use of new open source data ingestion tools like Data Load Tool (dlt) as a replacement for more established ones like Airbyte in certain use cases (ex API data integration). A dlt project contributor has published a blog post detailing the internal data pipeline design and core functions of dlt, including data extraction, normalisation, and loading. The latest version leverages Apache Arrow's efficient in-memory data structure to optimise the entire pipeline. —> Read more

👉 Yet Another Kafka Explanation

There's a wealth of resources available explaining Kafka's architecture and internals. I found a recent blog post series on the topic, which provides clear and concise explanations of the concepts, accompanied by helpful visuals for those unfamiliar with Kafka's design and operation.

Kafka Architecture Overview | Design elements | Kafka Producer | Kafka Consumer

👉 DuckDB's Internal Memory and Buffer Management

If you've used DuckDB or are exploring its capabilities, you might wonder how it handles large datasets without memory limitations, a common issue with some Python dataframes like Pandas. DuckDB’s official blog has recently covered the engine's internal memory and buffer management. It explains how DuckDB leverages streaming execution to process queries without fully loading CSV or Parquet files into memory, and utilises disk spilling when intermediate results exceed memory capacity. —> Read more

👉 Snowflake's Micro Partitioning Internal Design

If you've been using Snowflake at your company, you're likely familiar with its internal partitioning feature called micro-partitioning. This process automatically divides tables into micro-partitions of 50 MB and 500 MB, organising data in a columnar format within each micro-partition. A concise and excellent blog post provides a clear explanation of micro-partitioning's internal design, complete with helpful visuals. —> Read more

👉 Overview of Kafka's Tiered Storage Design

Kafka 3.6, released in 2023, introduced a highly anticipated feature: Tiered Storage. This feature currently supports local and remote storage tiers, enabling the movement of inactive segments to a configurable deep storage solution like HDFS or S3 based on local retention settings. This provides a cost-effective and scalable way to retain historical data. Uber is credited with driving the tiered storage proposal [KIP-405], discussing the internals of the tiered storage architecture. —> Read more

🔎 Case Studies

👉 Implementation of dlt in Production

It's inspiring to hear about individuals and teams embracing new open-source tools and technologies in production environments. Data Load Tool (dlt) is a relatively new ETL library covered earlier, that has been adopted by some for production workloads. Alexander from Dataops explores the advantages and disadvantages of using dlt compared to more established data integration tools like Airbyte. —> Read more

👉 Notion's Data Architecture Evolution

Notion has unveiled their new data lakehouse architecture. They chose Apache Hudi as their table format due to its efficient incremental data ingestion capabilities, making it suitable for their update-heavy workloads. The architecture also incorporates event-based CDC ingestion using Debezium and Kafka. —> Read more

💬 Community Discussions

👉 This career advice from a senior data engineer highlights a key point in one of Reddit's highest-rated data engineering discussions in July:

Master the data engineering fundamentals first!

While flashy tools and platforms come and go, a strong foundation in low-level skills like Bash, Git, SQL, pure Python development, and containerisation will take you much further. Juniors who prioritise these foundational skills before diving into advanced tools and stacks will be better-positioned for success.

🎥 Conferences & Events

👉 The annual virtual PrestoCon 2024 day organised by Linux Foundation/Presto Foundation took place in June, discussing topics like Presto 2.0 native C++ engine, and Presto usage at companies like Uber. A recap of the event and the main sessions is provided in this article. All the recorded 24 sessions can be found on Youtube.

Practical Data Engineering