<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Practical Data Engineering]]></title><description><![CDATA[A Newsletter To Level Up Your Data Engineering Skills brought to you by Zyelabs - Data engineering Consulting]]></description><link>https://www.pracdata.io</link><image><url>https://substackcdn.com/image/fetch/$s_!SGaR!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png</url><title>Practical Data Engineering</title><link>https://www.pracdata.io</link></image><generator>Substack</generator><lastBuildDate>Wed, 29 Apr 2026 17:58:22 GMT</lastBuildDate><atom:link href="https://www.pracdata.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Alireza Sadeghi]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[practicaldataengineering@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[practicaldataengineering@substack.com]]></itunes:email><itunes:name><![CDATA[Alireza Sadeghi]]></itunes:name></itunes:owner><itunes:author><![CDATA[Alireza Sadeghi]]></itunes:author><googleplay:owner><![CDATA[practicaldataengineering@substack.com]]></googleplay:owner><googleplay:email><![CDATA[practicaldataengineering@substack.com]]></googleplay:email><googleplay:author><![CDATA[Alireza Sadeghi]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Is DuckLake a Step Backward?]]></title><description><![CDATA[Examining the new open table format&#8217;s return to relational metadata management]]></description><link>https://www.pracdata.io/p/is-ducklake-a-step-backward</link><guid isPermaLink="false">https://www.pracdata.io/p/is-ducklake-a-step-backward</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 30 Nov 2025 07:30:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!WD8O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WD8O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WD8O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 424w, https://substackcdn.com/image/fetch/$s_!WD8O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 848w, https://substackcdn.com/image/fetch/$s_!WD8O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!WD8O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WD8O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png" width="1456" height="983" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:983,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2727013,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WD8O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 424w, https://substackcdn.com/image/fetch/$s_!WD8O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 848w, https://substackcdn.com/image/fetch/$s_!WD8O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!WD8O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db21fa2-93bd-436e-9f1a-8cdd464b5d5a_1922x1298.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When I first read Kafka creator <strong>Jay Kreps</strong>&#8217; article about <strong><a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">logs</a></strong> (and his subsequent book, <strong>I &#9829; Logs</strong>), I was fascinated by the elegance of immutable, append-only sequences of records. </p><p>Logs represent one of the simplest yet most powerful abstractions for managing data on physical storage. Since the invention of Apache Kafka and similar log-structured systems, the big data community has embraced the simplicity and scalability of log-oriented semantics. </p><p>This paradigm shift manifested most notably in modern data lakehouse table formats like <strong>Apache Iceberg</strong>, <strong>Apache Hudi</strong> and <strong>Delta Lake</strong> where&#8212;unlike traditional database implementations&#8212;all metadata operations are captured in <strong>immutable logs</strong> stored alongside data files on cloud storage such as S3.</p><p>But then comes <strong>DuckLake</strong>, a new open table format introduced by the DuckDB creators who have boldly rejected the log-oriented metadata philosophy. </p><p>These contrarians seem determined to challenge big data and distributed systems conventions built over the last decade. First, they built <a href="https://www.pracdata.io/p/duckdb-beyond-the-hype">DuckDB</a> to argue you don&#8217;t need distributed systems like Spark for majority of workloads. Now they&#8217;ve built DuckLake to claim you don&#8217;t need distributed metadata logs and Iceberg!</p><p>This article is going to examine DuckLake&#8217;s metadata design in comparison to current and previous generation table formats. </p><p>To do an architectural comparison, first let&#8217;s look at the primary design goals of modern log-oriented table formats.</p><h1>Log-Oriented Open Table Format Design</h1><p>Modern open table formats store metadata in immutable files right alongside the data for each dataset in object storage. Why is that advantageous?</p><p>This enables independent, horizontal scaling of both data and metadata. There&#8217;s no need to manage external metadata servers, no backend <strong>Metastore</strong> to become a choke point, and a single protocol (storage REST API) to handle everything, both data and metadata.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b6WL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b6WL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 424w, https://substackcdn.com/image/fetch/$s_!b6WL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 848w, https://substackcdn.com/image/fetch/$s_!b6WL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 1272w, https://substackcdn.com/image/fetch/$s_!b6WL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b6WL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png" width="1305" height="831" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d94912ac-d840-4e18-996b-6b13603f4605_1305x831.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:831,&quot;width&quot;:1305,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:455343,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b6WL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 424w, https://substackcdn.com/image/fetch/$s_!b6WL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 848w, https://substackcdn.com/image/fetch/$s_!b6WL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 1272w, https://substackcdn.com/image/fetch/$s_!b6WL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd94912ac-d840-4e18-996b-6b13603f4605_1305x831.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With this scalable architectural foundation, open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake have emerged as the dominant approach for managing data on cloud object stores.</p><p>Now, with DuckLake entering the scene, an obvious question arises: why build yet another lakehouse table format when the market already seems saturated? What gap could DuckLake possibly fill, and what makes it different from existing table format architectures?</p><p>To answer that, let&#8217;s first briefly recap how we got here and why major open table formats adopted their current log-oriented metadata architecture. </p><p>Understanding this history is essential because DuckLake&#8217;s design bears similarities to the previous generation Apache Hive system.</p><h2>The Hadoop &amp; Apache Hive Era</h2><p>In the Hadoop era, Apache Hive (+ Hive Metastore) served as the dominant table format on data lakes.</p><p>However, Hive suffered from two critical bottlenecks that motivated top tech companies like <strong>Uber</strong> and <strong>Netflix</strong> to develop new table formats for managing petabyte-scale data lake platforms:</p><h3>1. Slow Metadata Operations During Query Planning</h3><p>Hive table format was a directory-oriented format that relied heavily on directory and file listing operations during the query planning phase. </p><p>The critical problem for Netflix and similar companies managing their big data platforms on cloud, was that Hive table format was architected for Hadoop HDFS, not cloud object storage like S3.</p><p>When migrating from Hadoop HDFS storage to cloud object storage like S3, the I/O patterns change dramatically: fast local RPC calls are replaced with slower REST API  calls. Additionally, cloud storage introduces new constraints such as Per-API-call charges, API limits  (e.g., 1,000 objects per list operation) and rate limits that didn&#8217;t exist in Hadoop/HDFS environments.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vKXg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vKXg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 424w, https://substackcdn.com/image/fetch/$s_!vKXg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 848w, https://substackcdn.com/image/fetch/$s_!vKXg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 1272w, https://substackcdn.com/image/fetch/$s_!vKXg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vKXg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png" width="1456" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:503873,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vKXg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 424w, https://substackcdn.com/image/fetch/$s_!vKXg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 848w, https://substackcdn.com/image/fetch/$s_!vKXg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 1272w, https://substackcdn.com/image/fetch/$s_!vKXg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98250f7b-992d-43bb-8447-d59cb7eeaf01_1787x1151.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>For tables with millions of data files across many partitions, query planning becomes both painfully slow and expensive. </p><p>Listing all files, issuing HEAD requests, and reading metadata sections from millions of Parquet files&#8212;each with network round-trip latencies up to 100ms&#8212;creates a significant performance bottleneck.</p><h3>2. Inefficient Transaction and Mutable Data Management</h3><p>Beyond performance issues, Hive lacked efficient transaction and concurrency management, and relied on pessimistic locking mechanisms often resulting in long lock contentions. </p><p>To make matters worse, other engines like Trino didn&#8217;t respect Hive&#8217;s locking mechanisms and bypassed them entirely, creating data consistency risks.</p><p>For a deeper dive into this history, I&#8217;ve previously written a comprehensive two-part series on the evolution of open table formats:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6f8caf18-9e2a-4afc-8878-6e62a7853452&quot;,&quot;caption&quot;:&quot;If you have been following trends in data engineering landscape over the past few years surely you have been hearing a lot about Open Table Formats and Data Lakehouse, if not already working with them! But what is all the hype about table formats if they have always existed and we have always been working with tables when dealing with structured data in&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The History and Evolution of Open Table Formats - Part I&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:3524999,&quot;name&quot;:&quot;Alireza Sadeghi&quot;,&quot;bio&quot;:&quot;Design, Build and Scale Data Platforms. Talk about #dataengineering https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f2c1bd7-7ad0-45b3-a325-7e369db6965b_576x576.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-15T06:27:28.244Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!0mek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.pracdata.io/p/the-history-and-evolution-of-open&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147608976,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:16,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1855168,&quot;publication_name&quot;:&quot;Practical Data Engineering&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!SGaR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><h1>Emergence of New Open Table Formats</h1><p>Facing these limitations, tech giants like Netflix and Uber recognised the need for a fundamentally different architecture. A new table format that can handle massive workloads far more efficiently. </p><p>Two key design requirements emerged to improve query planning performance:</p><p>1. <strong>Store the list of all data files in the metadata layer</strong> to avoid excessive and slow object listing calls to storage APIs during query planning.</p><p>2. <strong>Store data file column statistics (such as min/max values) in the metadata layer</strong> to eliminate the need to read and parse built-in column stats from each Parquet or ORC data file at query execution time, enabling efficient file-level data file pruning (often referred to as predicate pushdown) right at the query planning time.</p><p>But this raised a critical question: <strong>where should these massive amounts of file and column statistics be stored?</strong> </p><p>For a typical petabyte-scale data lake with hundreds of millions of data files&#8212;and tables averaging around 10 columns&#8212;you&#8217;re looking at billions of metadata records for column statistics alone.</p><p>I imagine the engineers designing these table formats took a hard look at Hive Metastore&#8217;s architecture and quickly rejected the idea of storing such massive metadata volumes in a centralised database system. Even with continuous scale-up efforts, Hive Metastore was already becoming a bottleneck&#8212;adding billions of metadata records would only make matters worse.</p><p>Combine this realisation with the proven scalability of distributed logs, and storing all metadata on object storage became the obvious choice. This approach delivers a shared-nothing metadata architecture with theoretically infinite scalability. </p><p>Similar to how data files are distributed, metadata is partitioned per dataset, allowing each table to manage its own metadata independently.</p><p>With this design, a table&#8217;s metadata can grow linearly with the data itself without ever becoming a system-level bottleneck. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FrIl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FrIl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 424w, https://substackcdn.com/image/fetch/$s_!FrIl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 848w, https://substackcdn.com/image/fetch/$s_!FrIl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!FrIl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FrIl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png" width="1456" height="1041" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1041,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:426566,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FrIl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 424w, https://substackcdn.com/image/fetch/$s_!FrIl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 848w, https://substackcdn.com/image/fetch/$s_!FrIl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 1272w, https://substackcdn.com/image/fetch/$s_!FrIl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78961ece-401d-494c-940e-ed3b2d80c95b_2113x1511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This design has made many database architects <a href="https://www.database-doctor.com/posts/iceberg-is-wrong-1.html">sceptical</a>. After all, HTTP wasn&#8217;t designed for the low-latency, high-concurrency demands of database storage&#8212;particularly for metadata management, which requires fast reads, writes, and atomic updates.</p><p>Despite these scepticism, modern open table formats have gained widespread adoption across the data community. </p><p>This widespread adoption of log-oriented formats might seem like vindication of the architecture, but success doesn&#8217;t mean perfection. Every design involves compromises.</p><p></p><h2>The Tradeoffs</h2><p>No design comes without tradeoffs. The pain just shifts from one layer to another, and the distributed log-oriented metadata architecture is no exception. </p><p>But what could possibly go wrong? On paper, it all sounds great. In practice, things get complicated.</p><p>It turns out metadata has fundamentally different performance and management characteristics than data itself. While large-scale data processing optimises mainly for high throughput, metadata needs to be written, read, updated, deleted and rewritten quickly, reliably, and atomically with low latency.  </p><p>By design, object stores are not the best match for such workloads.</p><h3>Challenge #1: Metadata Amplification &amp; Operational Complexity</h3><p>A major issue with new-generation open table formats is excessive metadata file generation. </p><p>Since underlying storage is immutable, each DML operation (insert, update, delete) requires creating a new (or multiple) metadata log file.</p><p>In Iceberg, every write generates a new Manifest file, a new Manifest List, and a new JSON Metadata file. On platforms with heavy updates, millions of tiny log files can be generated in a short period of time. </p><p>This quickly becomes counterproductive without constant housekeeping to merge and remove obsolete log files. Ironically, the original goal of log-based metadata was to eliminate excessive storage API calls during query planning&#8212;but now the problem has simply shifted to a different layer: metadata itself. </p><p>How is this handled? Through constant housekeeping operations: expiring older snapshots, background compaction jobs continuously merging small metadata log files into larger ones and dropping obsolete files.</p><h3>Challenge #2: Atomicity &amp; Concurrency Control</h3><p>While all new open table formats support read-write isolations through Snapshot Isolation and MVCC, Atomicity on the storage layer and support for multi-write concurrency control (to same object) have been challenging across different storage backends.</p><p>Most distributed cloud storage systems lack full ACID compliance and typically provide only eventual consistency guarantees. For instance, AWS S3 doesn&#8217;t support atomic file renames. <strong>Atomic put-if-absent</strong> only became available on AWS recently.</p><p>Apache Iceberg relies on an external catalog&#8217;s atomic commit (such as a transactional database) for atomically update the pointer to the latest metadata file containing the latest snapshot Id. </p><p>To support concurrent writes, Delta often uses a lightweight external lock (commonly DynamoDB-based) to coordinate exclusive access to the log files during commit to deal with race condition of two writes trying to create the same log file (with same version number).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3Wpg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Wpg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 424w, https://substackcdn.com/image/fetch/$s_!3Wpg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 848w, https://substackcdn.com/image/fetch/$s_!3Wpg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 1272w, https://substackcdn.com/image/fetch/$s_!3Wpg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Wpg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png" width="1456" height="786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:786,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:264548,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3Wpg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 424w, https://substackcdn.com/image/fetch/$s_!3Wpg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 848w, https://substackcdn.com/image/fetch/$s_!3Wpg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 1272w, https://substackcdn.com/image/fetch/$s_!3Wpg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F909f6a3d-df12-4313-ba73-bf90be9832b1_1478x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Apache Hudi additionally supports table-level locks through external services such as Zookeeper, Hive Metastore, or DynamoDB to coordinate concurrent writers.</p><h3>Challenge #3: Dependency on External Catalog Service</h3><p>The third major challenge is the lack of a native fully-featured catalog. This partially explains the <a href="https://www.onehouse.ai/blog/comprehensive-data-catalog-comparison">data catalog wars</a> in 2024, when major vendors competed to become the dominant catalog service for data lakehouse platforms.</p><p>While schemas are stored and evolved within the metadata layer, allowing direct table interaction without a central catalog, these formats lack the infrastructure for scalable data discovery, governance, and business-level metadata management that users and external query engines require.</p><p>The following diagram shows a typical open lakehouse architecture, with an open catalog as a key component of the architecture.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JaHK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JaHK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 424w, https://substackcdn.com/image/fetch/$s_!JaHK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 848w, https://substackcdn.com/image/fetch/$s_!JaHK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 1272w, https://substackcdn.com/image/fetch/$s_!JaHK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JaHK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png" width="1408" height="980" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:980,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1315242,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JaHK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 424w, https://substackcdn.com/image/fetch/$s_!JaHK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 848w, https://substackcdn.com/image/fetch/$s_!JaHK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 1272w, https://substackcdn.com/image/fetch/$s_!JaHK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8759cf8f-2ae7-40e9-a19c-522703f6ef37_1408x980.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p>Having outlined these challenges, it&#8217;s worth revisiting Hive Metastore to recognise its inherent strengths despite its weaknesses and scalability issues. </p><p>First, it was backed by transactional SQL databases such as MySQL and PostgreSQL, which made managing metadata operations in a concurrent, transaction-safe manner straightforward with full CRUD support.</p><p>Second, the Metastore served dual roles as both the metadata backend and the catalog for managing schemas. This meant users and all external query engines&#8212;Athena, Spark, Trino, Presto&#8212;could access and manage schemas directly through Hive Metastore component without requiring any other external catalog service. </p><p>These observations set the stage for understanding DuckLake&#8217;s philosophy.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>Back to DuckLake</h1><p>Now that we have the required background info let&#8217;s go back to DuckLake.</p><p>I believe DuckLake aims to achieve the best of both worlds: return to familiar, reliable, update-friendly, and transactionally safe database systems for managing metadata, while incorporating the query performance and table management features pioneered by modern open table formats.</p><p>DuckLake is built on a foundational premise: </p><div class="pullquote"><p>Relational SQL database systems remain superior for managing metadata, unlike existing open table formats that rely on immutable log-structured metadata stored on object stores.</p></div><p>The DuckLake creators observed the challenges open table formats have faced over the past five years and reached a surprising conclusion: perhaps using a centralised transactional SQL database for metadata management&#8212;as Hive did&#8212;wasn&#8217;t such a flawed approach after all.  The key is to preserve what worked while fixing what didn&#8217;t.</p><p></p><h3>What exactly is DuckLake&#8217;s critique of the log-oriented approach?</h3><p>Managing table metadata through scattered, immutable log files on object stores creates significant operational complexity. </p><p>The challenges include handling excessive numbers of small metadata files, high latency when traversing and scanning metadata logs, and managing the complete lifecycle of table snapshots, version files, data file metadata, partition information, and column statistics. </p><p>Therefore, DuckLake proposes moving all scattered metadata structures back to a familiar, reliable centralised SQL database. But wait... haven&#8217;t we been here before? Isn&#8217;t this a step backward to Apache Hive + Metastore architecture?  </p><p>Yes, it&#8217;s a small step backward&#8212;but moving backward isn&#8217;t always a bad thing. The key difference between DuckLake and Hive lies in storing complete data file details in the Metastore. </p><p>Remember, one of Hive&#8217;s major performance bottlenecks was expensive LIST calls to discover all data files for each query plus column stats gathering during query planning phase. </p><p>DuckLake eliminates this bottleneck by tracking all data files and storing column-level statistics for each file in the metadata layer, just like modern open table formats. The critical difference is that these details live in relational database tables rather than scattered log files.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gRmO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gRmO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 424w, https://substackcdn.com/image/fetch/$s_!gRmO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 848w, https://substackcdn.com/image/fetch/$s_!gRmO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 1272w, https://substackcdn.com/image/fetch/$s_!gRmO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gRmO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png" width="1456" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:398120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gRmO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 424w, https://substackcdn.com/image/fetch/$s_!gRmO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 848w, https://substackcdn.com/image/fetch/$s_!gRmO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 1272w, https://substackcdn.com/image/fetch/$s_!gRmO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac05aaee-a7f6-4091-84c9-e87a81622a83_2553x1512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Additionally, DuckLake maintains table snapshot information directly in the Metastore, simplifying snapshot history tracking and enabling snapshot isolation through MVCC.</p><p>DuckLake follows the proven approach of modern OLAP systems like BigQuery and Snowflake, which successfully manage data on object storage while maintaining all metadata in database systems.</p><p>This all sounds compelling&#8212;simpler operations, transactional guarantees, familiar SQL interfaces. But there&#8217;s an elephant in the room that we must address.</p><h2>What about Scalability?</h2><p>Here comes the pivotal question: can DuckLake scale? </p><p>After all, scalability challenges were also part of the reason why Iceberg and Hudi were built to replace Hive table format.</p><p>A full-featured DuckLake implementation will likely struggle to scale to very large, petabyte-scale platforms, without significant database tuning, vertical scaling and potentially using a distributed SQL database system (ex Neon).  However, not everyone operates at that scale&#8212;so DuckLake will likely work well without running into metadata bottleneck issues for the majority of workloads.</p><h3>The Scalability Challenge</h3><p>DuckLake stores everything in a centralised managed metadata database. Certain metadata tables will accumulate substantial volume of data:</p><p>1. <code>data_file </code>table: Stores the list of Parquet data files for each table</p><p>2. <code>delete_file </code>table: Tracks deleted Parquet files for each table</p><p>3. <code>file_column_statistics </code>table: Stores column-level statistics (min/max, null count, etc.) for each column in each Parquet file</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lQ6q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lQ6q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 424w, https://substackcdn.com/image/fetch/$s_!lQ6q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 848w, https://substackcdn.com/image/fetch/$s_!lQ6q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 1272w, https://substackcdn.com/image/fetch/$s_!lQ6q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lQ6q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png" width="698" height="481" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:481,&quot;width&quot;:698,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:165551,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lQ6q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 424w, https://substackcdn.com/image/fetch/$s_!lQ6q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 848w, https://substackcdn.com/image/fetch/$s_!lQ6q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 1272w, https://substackcdn.com/image/fetch/$s_!lQ6q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F063f71f2-55a3-4e10-a2db-02129cdb4eee_698x481.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>You can check the full table spec <a href="https://ducklake.select/docs/stable/specification/tables/overview">here</a></em></figcaption></figure></div><p>Consider a data lake with one petabyte of data&#8212;and we&#8217;re not even discussing companies managing tens or hundreds of petabytes. A typical 1 PB data lake contains approximately 50-100 million data files. With optimisation and reasonable file sizes, you might reduce this somewhat, but the order of magnitude remains.</p><p>Assuming an average of 10 columns per table, this generates:</p><ul><li><p>100 million records in the <code>data_file</code> table</p></li><li><p>1 billion records in the <code>file_column_statistics</code> table (100M files &#215; 10 columns)</p></li><li><p>Hundreds of millions of records in the <code>delete_file</code> table over time from merge and insert-overwrite workloads</p></li></ul><p>This doesn&#8217;t account for historical records from expired snapshots (before expiring snapshots and compaction), which increases volumes further.</p><h3>Query-Time Implications</h3><p>Storing and writing to tables with 100s of millions or billions of rows may not present a significant challenge, provided the database server has adequate storage and the tables aren&#8217;t heavily indexed in ways that slow down writes.</p><p>The real performance challenge lies in querying these tables. Consider querying historical data containing millions of files with many columns. For a typical large query scanning 1 million data files with 20 columns, the metadata stage must:</p><p>1. Select <strong>1 million records</strong> from the <code>data_file</code> table</p><p>2. Join against records from the <code>delete_file</code> table to filter out deleted files</p><p>3. Scan <strong>20 million records</strong> (20 column stats records per data file) from the <code>file_column_statistics</code> table </p><p>However, we don&#8217;t need to scan and read all these records. Remember, we&#8217;re in the SQL world&#8212;we can easily filter metadata and only retrieve the candidate list of data files that must be scanned for matching records.</p><p>Here&#8217;s a sample query to retrieve the list of data files from a <em>clickstream</em> table where the query filters by <code>event_timestamp</code> column:</p><pre><code>SELECT data_file_id
FROM file_column_stats
WHERE
  table_id = 'clickstream' AND
  column_id = 'event_timestamp' AND
  ('timestamp-value' &gt;= min_value OR min_value IS NULL) AND
  ('timestamp-value' &lt;= max_value OR max_value IS NULL);</code></pre><h3>DuckLake Creators&#8217; Perspective</h3><p>In a recent <a href="https://www.youtube.com/watch?v=YQEUkFWa69o&amp;list=PL9eL-xg48OM3E1AN2f40m9iv0SZEAOGKz&amp;index=7">presentation</a>, <strong>Prof. Hannes M&#252;hleisen</strong> claims that DuckLake can easily handle petabyte-scale data, citing tests conducted on a table with 100 million snapshots.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RnEk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RnEk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 424w, https://substackcdn.com/image/fetch/$s_!RnEk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 848w, https://substackcdn.com/image/fetch/$s_!RnEk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 1272w, https://substackcdn.com/image/fetch/$s_!RnEk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RnEk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png" width="1149" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:397,&quot;width&quot;:1149,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142070,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RnEk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 424w, https://substackcdn.com/image/fetch/$s_!RnEk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 848w, https://substackcdn.com/image/fetch/$s_!RnEk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 1272w, https://substackcdn.com/image/fetch/$s_!RnEk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F522b3ef8-8d35-460e-84fa-681f6b9ea8f2_1149x397.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Jordan Tigani</strong> <a href="https://www.youtube.com/watch?v=z2GhznqtIz0">draws parallels</a> between DuckLake&#8217;s column statistics storage and <strong>BigQuery&#8217;s CMETA</strong> metadata table, which serves a similar purpose. However, according to BigQuery&#8217;s published <a href="https://research.google/pubs/big-metadata-when-metadata-is-big-data/">research paper</a> on the subject, BigQuery stores column-level statistics separately for each individual table rather than in a single central metadata table. </p><p>This means, similar to how existing lakehouse formats manage metadata independently per dataset, there exists a one-to-one mapping between each user table (e.g., <code>sales</code>) and its corresponding column metadata statistics table (e.g., <code>cmeta_sales</code>). This design avoids possible large central metadata tables and enables horizontal scalability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6yCK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6yCK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 424w, https://substackcdn.com/image/fetch/$s_!6yCK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 848w, https://substackcdn.com/image/fetch/$s_!6yCK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 1272w, https://substackcdn.com/image/fetch/$s_!6yCK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6yCK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png" width="1300" height="545" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:545,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212187,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6yCK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 424w, https://substackcdn.com/image/fetch/$s_!6yCK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 848w, https://substackcdn.com/image/fetch/$s_!6yCK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 1272w, https://substackcdn.com/image/fetch/$s_!6yCK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd2d0f72-d52b-444d-813a-0d15fe5500e9_1300x545.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reference: <a href="https://cloud.google.com/blog/products/data-analytics/understanding-the-bigquery--column-metadata-cmeta-index/">link</a></figcaption></figure></div><p>Similarly, <strong>Snowflake</strong> stores such metadata on its highly optimised distributed metadata layer, which appears to be built on the <strong>FoundationDB</strong> distributed key-value store.</p><p>As Jordan emphasises, nothing inherently prevents a DuckLake implementation from using a similar scalable and distributed storage engine for the metadata layer. </p><p>A vendor with sufficient expertise and resources&#8212;such as <strong>MotherDuck</strong> itself&#8212;could reasonably manage such infrastructure, much as Google and Snowflake do. However, for self-hosted deployments, implementing and operating such a system would introduce significant operational complexity and cost.</p><h3>The Bottom Line</h3><p>Can the backend metadata server (e.g., PostgreSQL) efficiently manage tables with billions of records and handle many concurrent readers and writers without becoming a performance bottleneck during metadata retrieval and query planning?</p><p>This remains to be battle-tested. With a well-tuned database on good-sized hardware and efficiently implemented indexing, clustering or partitioning (or even using distributed database backend) things could work out. But the important thing is to see and weigh out the trade-offs between the two metadata architectural patterns.</p><div><hr></div><p>Having explored both architectural approaches in depth, let&#8217;s consolidate our understanding with a high-level comparison of key features and design choices.</p><h1>High-level Feature Comparison</h1><p>While this blog post is not about feature comparison between different open table formats, here is a high level comparison of some key features and architectural differences between DuckLake and other table formats:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xqBE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xqBE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 424w, https://substackcdn.com/image/fetch/$s_!xqBE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 848w, https://substackcdn.com/image/fetch/$s_!xqBE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 1272w, https://substackcdn.com/image/fetch/$s_!xqBE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xqBE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png" width="1456" height="663" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:663,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319644,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xqBE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 424w, https://substackcdn.com/image/fetch/$s_!xqBE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 848w, https://substackcdn.com/image/fetch/$s_!xqBE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 1272w, https://substackcdn.com/image/fetch/$s_!xqBE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822365ba-6739-44a6-ab45-323d7ceb2d61_2184x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As highlighted in the table above, DuckLake shares significant architectural similarities with Apache Hive while incorporating modern optimisations that address Hive&#8217;s shortcomings&#8212;specifically data file tracking, column-level statistics, improved concurrency management, and snapshot tracking.</p><p>With this technical foundation established, we can now address the central question: what does DuckLake&#8217;s future hold?</p><h1>DuckLake&#8217;s Path Forward</h1><p>It&#8217;s still too early to predict DuckLake&#8217;s fate. Just months after its launch, it faces fierce competition in a lakehouse market already dominated by established open table formats. The road ahead is challenging if DuckLake hopes to gain meaningful traction.</p><p>My personal assessment is that DuckLake has genuine potential, but not necessarily for large tech companies operating at massive scale. And that&#8217;s perfectly fine. </p><p>While MotherDuck will offer managed DuckLake at cloud scale, the open-source version could find strong adoption among small-to-medium self-hosted data lakes that never approach the metadata bottleneck of billions of records. After all, far more organisations manage small to moderate data volumes (&lt; 100 TB) than operate at extreme scale.</p><p>Since DuckDB has already gained significant traction with strong ecosystem integration and support, DuckLake as a DuckDB extension can leverage this momentum. </p><p>Together, they could provide a lightweight <strong>Hybrid Analytical Lakehouse Storage (HALS)</strong> solution. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qnPB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qnPB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 424w, https://substackcdn.com/image/fetch/$s_!qnPB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 848w, https://substackcdn.com/image/fetch/$s_!qnPB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 1272w, https://substackcdn.com/image/fetch/$s_!qnPB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qnPB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png" width="1456" height="1034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1034,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:930074,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/180096864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qnPB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 424w, https://substackcdn.com/image/fetch/$s_!qnPB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 848w, https://substackcdn.com/image/fetch/$s_!qnPB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 1272w, https://substackcdn.com/image/fetch/$s_!qnPB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a482b2-0b6f-4db3-9f77-58a66668358b_1672x1187.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This represents a simple compelling unified analytics solution for anyone already using DuckDB in production who wants to extend its capabilities to managing data on data lakes.</p><p>With the rise of <a href="https://www.pracdata.io/p/the-rise-of-single-node-processing">single-node processing paradigm</a>, this solution extends that approach to manage unstructured and semi-structured data processing use cases, where storing data as Parquet files on object storage could deliver cost and performance benefits.</p><h2>The Ecosystem Challenge</h2><p>Beyond DuckDB&#8217;s territory, DuckLake&#8217;s success&#8212;particularly as an open-source portable and open table format&#8212;hinges entirely on community adoption. </p><p>So far in the months since its introduction, no significant contributions have emerged from outside DuckDB-related entities (DuckDB Labs, MotherDuck).</p><p>It&#8217;s important to note that, like Apache Iceberg, DuckLake is primarily a specification, not an engine. Just as Iceberg has implementations in Spark, Trino, and other engines, DuckLake can be implemented based on its published specification. </p><p>To reach its full potential, DuckLake spec requires broad adoption and integration throughout the data ecosystem. This includes support for popular Python data libraries like Polars, DataFusion, and Arrow; connectors for distributed processing engines such as Spark, Trino, and Flink; and data integration services like Fivetran, AWS Glue, dbt, and Airbyte. </p><p>Besides tooling, DuckLake requires genuine community interest and incentive to keep pace with the specification&#8217;s evolution as it matures over the coming years. </p><p>Without this backing, DuckLake risks remaining confined to the DuckDB ecosystem&#8212;limited to the DuckDB extension or MotherDuck&#8217;s managed cloud platform. </p><p>Whether the broader data community sees enough value to invest in DuckLake&#8217;s future remains the defining question.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[DLD #5 | Data Landscape Digest 🗞️]]></title><description><![CDATA[S3 Table Abstraction, Airflow's date intervals, Confluent TableFlow, DuckDB Lakehouse analytics, Google Next Keynotes and more.]]></description><link>https://www.pracdata.io/p/dld-5-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-5-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Mon, 21 Apr 2025 11:48:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#10024; Featured - Table Buckets as a New Lakehouse Abstraction</h1><div><hr></div><p>Since AWS announced <strong>S3 Tables</strong> last December, many industry experts have shared their opinions through blogs and social media. Most of the feedback I've read so far has been somewhat negative. Critics like <strong><a href="https://dataengineeringcentral.substack.com/p/aws-s3-tables-the-iceberg-cometh">Daniel</a></strong> argue that S3 Tables create vendor lock-in. They see it as a proprietary solution that limits flexibility and openness required by true Open Lakehouse architectures.</p><p>Critics also highlight that S3 Tables mainly integrate with AWS services such as <strong>Glue</strong>, <strong>Athena</strong>, and <strong>EMR</strong>, lacking sufficient support for general-purpose tools and services.</p><p>Despite these criticisms, broader support for S3 Tables is emerging. <strong>PyIceberg</strong> now supports accessing S3 Tables through Glue Catalog, and <a href="https://aws.amazon.com/blogs/storage/streamlining-access-to-tabular-datasets-stored-in-amazon-s3-tables-with-duckdb/">preview support</a> for querying S3 Tables using Iceberg REST API endpoints, such as Amazon SageMaker Lakehouse REST endpoints, has been added to <strong>DuckDB</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lfmz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lfmz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 424w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 848w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1272w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lfmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png" width="1108" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:1108,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42453,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/161791484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lfmz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 424w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 848w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1272w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this context, <strong>Werner Vogels</strong>, Amazon's CTO, recently published another excellent article on his All Things Distributed blog. He reflects on the 19-year evolution of Amazon S3, from simple object storage to a comprehensive data solution. Vogels emphasises how customer feedback has shaped the development of S3's major capabilities.</p><p>Vogels further explains that S3's latest innovation, S3 Tables, directly addresses customer feedback and pain points experienced working with open table formats on S3 object store. Its main goal is to enhance lakehouse data management by resolving these challenges. Vogels also discusses the <strong>ongoing tension between simplicity and velocity</strong>, a common trade-off faced in product development. I highly recommend reading his article. </p><p><strong><a href="https://www.allthingsdistributed.com/2025/03/in-s3-simplicity-is-table-stakes.html">Read the full article here</a></strong></p><div><hr></div><h1>&#128225; Open Source News</h1><div><hr></div><h3>&#128073; Apache Flink 2.0.0 Release</h3><p><strong>Apache Flink 2.0.0</strong> has been officially released, marking the first major Flink release since version 1.x launched nine years ago. Key innovations include disaggregated state management, materialised tables for unified stream-batch processing, and enhanced SQL capabilities. Check out the full highlights of new features and improvements. <strong><a href="https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/">--&gt; Read More</a></strong></p><h3>&#128073; Apache Kafka 4.0 Release</h3><p><strong>Apache Kafka 4.0</strong> is officially out, marking a significant milestone by removing the dependency on Apache ZooKeeper. Key features highlighted by the Confluent blog include a new consumer group protocol for improved rebalance performance, the introduction of Queues for Kafka to support traditional queue semantics, updated Java version requirements, and various enhancements through Kafka Improvement Proposals (KIPs). <strong><a href="https://www.confluent.io/blog/latest-apache-kafka-release/">--&gt; Read More</a></strong></p><h3>&#128073; DuckDB Now Has Its Own Web UI!</h3><p>From <strong>DuckDB v1.2.1</strong> a local web user interface (UI) for DuckDB, has been developed in collaboration with <strong>MotherDuck</strong>, aimed at enhancing user experience by simplifying database interactions. Using the new UI you can now execute SQL queries through interactive notebooks, explore databases and columns with advanced features like syntax highlighting and autocomplete. <strong><a href="https://delta.io/blog/delta-lake-optimize/">--&gt; Read More</a></strong></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h1> &#128736; Practical Data Engineering</h1><div><hr></div><h3>&#128073;  Implementation of Medallion Architecture Using ClickHouse</h3><p>A great hands-on example of implementing a <strong>medallion architecture</strong> using <strong>ClickHouse</strong> to process and analyse free data from the Bluesky social network. It highlights common challenges such as data duplication, malformed JSON, and inconsistent structures. It also demonstrates how ClickHouse features&#8212;including the JSON data type and the ReplacingMergeTree engine&#8212;effectively address these issues.  <strong><a href="https://clickhouse.com/blog/building-a-medallion-architecture-for-bluesky-json-data-with-clickhouse">--&gt; Read More</a></strong></p><h3>&#128073; DuckDB&#8217;s Enhanced Lakehouse Analytics</h3><p>Querying external Delta Lake tables from DuckDB has become significantly easier with recent improvements in the Delta extension. You can now attach to Delta tables and query them using aliases in a simpler and cleaner manner. Additionally, data-skipping enhancements accelerate federated queries over open table formats, demonstrating DuckDB's commitment to advance federated analytics capabilities over lakehouse table formats. <strong><a href="https://duckdb.org/2025/03/21/maximizing-your-delta-scan-performance.html">--&gt; Read More</a></strong></p><h3>&#128073; 21 Reasons to Consider Apache Hudi!</h3><p>While <strong>Apache Iceberg</strong> has emerged as the leading open table format and continues strong into 2025, <strong>Apache Hudi</strong> is making its case by highlighting 21 unique reasons to consider Hudi over Iceberg and Delta Lake. If you're currently evaluating open table formats for your next project or company, this comparison is worth checking out. <strong><a href="https://hudi.apache.org/blog/2025/03/05/hudi-21-unique-differentiators/">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#9881;&#65039; Technical Deep Dive</h1><div><hr></div><h3>&#128073; Parquet Data Skipping Mechanisms</h3><p>This concise article provides an excellent overview of how pruning and data skipping are typically performed on <strong>Parquet</strong> files using metadata and statistics at various levels, including row groups and pages. The described approach aligns with <strong>DataFusion</strong>'s implementation of the Parquet reading and pruning pipeline, but the general design principles apply broadly to other Parquet readers as well. <strong><a href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/">--&gt; Read More</a></strong></p><h3>&#128073; What's the Difference Between Arrow Flight, ADBC, and Arrow IPC? </h3><p>Still confused about the differences between Arrow Flight Protocol, ADBC, and Arrow IPC? This article clearly explains these key technologies within the <strong>Apache Arrow</strong> ecosystem. It also highlights the advantages of using Apache Arrow as a data interchange format, demonstrating how it enables faster and more efficient data exchange compared to traditional methods. <strong><a href="https://arrow.apache.org/blog/2025/02/28/data-wants-to-be-free/">--&gt; Read More</a></strong></p><h3>&#128073; Airflow Data Intervals: A Deep Dive</h3><p>Understanding <strong>Apache Airflow</strong>'s concept of time and intervals&#8212;such as start time, execution time, and logical dates&#8212;can be challenging. This article dives deep into the significance of data intervals in Airflow, clearly explaining their critical role in effectively scheduling and executing workflows. It illustrates how data intervals ensure each DAG run processes complete and accurate datasets, thus promoting idempotency and enabling reliable backfilling of historical data. <strong><a href="https://towardsdatascience.com/airflow-data-intervals-a-deep-dive-15d0ccfb0661/">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128195; Academic Papers</h1><div><hr></div><h3>&#128073; OLAP DBMS Archetype For The Next Ten Years?  </h3><p>An insightful paper co-authored by Michael Stonebraker and Andy Pavlo provides a comprehensive overview of the evolution of database management systems (DBMSs) since 2005. The authors predict the <strong>lakehouse architecture</strong> as the "<strong>OLAP DBMS archetype</strong>" for the coming decade, offering a unified infrastructure capable of supporting both SQL and non-SQL workloads&#8212;a vision initially conceived but not fully realised during the Hadoop era. <strong><a href="https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128188; Data Engineering Career</h1><div><hr></div><h3>&#128073; How Will AI Disrupt Data Engineering?  </h3><p>This is one of the most compelling questions facing software and data engineers today. <strong>Tristan Handy</strong>, founder and CEO of dbt Labs, shares insightful perspectives on how artificial intelligence will impact data engineering roles and what the future might hold for data engineers. <strong><a href="https://roundup.getdbt.com/p/how-ai-will-disrupt-data-engineering">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128170; Skill Up</h1><div><hr></div><h3>&#128073; Learning ClickHouse Fundamentals</h3><p>ClickHouse is offering free 3-hour training sessions on ClickHouse fundamentals scheduled for April 22 and May 15. If you're interested, be sure to sign up! <strong><a href="https://clickhouse.com/company/events/clickhouse-fundamentals">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128227; Vendors News &amp; Announcements</h1><div><hr></div><h3>&#128073; General Availability of Confluent's TableFlow</h3><p><strong>Confluent</strong> has announced the general availability of <strong>TableFlow</strong>, a technology that embraces stream-table duality. TableFlow simplifies the ingestion of event data from Kafka topics into structured lakehouse tables (currently supporting Iceberg and Delta Lake), abstracting away complex data engineering tasks. It manages the entire ingestion lifecycle, including schema evolution and table maintenance on Confluent Platform. <strong><a href="https://www.confluent.io/blog/latest-tableflow/)">--&gt; Read More</a></strong></p><h3>&#128073; Materialize Self-Managed &amp; Free Community Edition</h3><p>The<strong> Materialize</strong> streaming database service provider has introduced two new offerings: a Self-Managed version and a Free Community Edition. The Self-Managed option allows deployment of Materialize in your own environment, providing greater control over performance and compliance. The Community Edition grants free access to Materialize's powerful features with specific usage limits, making it easier to test or run small-scale production workloads. <strong><a href="https://materialize.com/blog/materialize-for-everyone/">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128270; Case Studies</h1><div><hr></div><h3>&#128073; Streaming Data Ingestion Into Cloud Data Warehouse  </h3><p>This insightful case study by <strong>Canva</strong> compares different data ingestion approaches into cloud data warehouses. It contrasts their previous micro-batch, file-based ingestion method (using AWS services like Firehose) with a new architecture using Snowflake's managed <strong>Snowpipe </strong>Ingestion service. The new solution achieves low-latency ingestion of billions of events daily, significantly reducing query latency to under 10 minutes and lowering overall cloud costs. <strong><a href="https://www.canva.dev/blog/engineering/snowpipe-streaming/">--&gt; Read More</a></strong></p><h3>&#128073; Using Key-Value Stores for Exactly-Once Streaming Ingestion  </h3><p>Event deduplication and exactly-once processing guarantees are crucial challenges in building reliable streaming pipelines, often complicated by pipeline failures and network issues. One popular solution involves using an external, fast key-value store to track and eliminate duplicate events during streaming. <strong>MyHeritage</strong> shares their experience implementing exactly-once processing using Spark Structured Streaming and a key-value store for deduplication. <strong><a href="https://medium.com/myheritage-engineering/exactly-once-processing-in-spark-structured-streaming-39eb5ffcaa27">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#127909; Conferences &amp; Events</h1><div><hr></div><h3>&#128073; Google Cloud Next '25 Keynotes  </h3><p><strong>Google Cloud Next '25</strong> took place earlier this month, and both the opening keynote and developer keynote are now available on YouTube. The conference again focused heavily on AI, highlighting Google's Gemini LLM foundation model, AI agents, the Agent Development Kit, and new features in Vertex AI. On the analytics front, the new capabilities of Google's Data Science Agent look particularly exciting. Be sure to check out these keynotes to stay updated on the latest trends and technologies from Google Cloud. </p><p><strong>Opening Keynote:</strong></p><div id="youtube2-Md4Fs-Zc3tg)" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Md4Fs-Zc3tg)&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Md4Fs-Zc3tg)?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong>Developer Keynote:</strong></p><div id="youtube2-xLDSuXD8Mls)" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;xLDSuXD8Mls)&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/xLDSuXD8Mls)?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[Open Source Data Engineering Landscape 2025]]></title><description><![CDATA[A comprehensive view of active open source tools and emerging trends in data engineering ecosystem in 2024-2025]]></description><link>https://www.pracdata.io/p/open-source-data-engineering-landscape-2025</link><guid isPermaLink="false">https://www.pracdata.io/p/open-source-data-engineering-landscape-2025</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Tue, 11 Feb 2025 08:04:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1><p>The open source data engineering landscape continues to evolve rapidly, with significant developments across storage, processing, integration, and analytics in 2024. </p><p>This marks the second year the <strong>open source data engineering landscape</strong> is published. The goal is to identify and showcase key active projects and prominent tools in the data engineering space, and provide a comprehensive overview of the dynamic data engineering ecosystem, key trends and developments.</p><p>While this landscape is published annually, the accompanying <strong><a href="https://github.com/pracdata/awesome-open-source-data-engineering">GitHub repository</a></strong> is updated regularly throughout the year. Feel free to contribute if you notice any missing component.</p><h2>Research Methodology</h2><p>Conducting such extensive research demands considerable effort and time. I continuously research and strive to stay informed about significant developments in the data engineering ecosystem throughout the year, including news, activities, trends, reports, and advancements.</p><p>Last year, I built my own <em><strong>little data platform </strong></em>to track GitHub public repository events, enabling better analysis of GitHub-related metrics of open source tools such as code activity, stars, user engagement, and issue resolution.</p><p>The stack includes a data lake (S3), Parquet as the serialisation format, DuckDB for processing and analytics, Apache NiFi for data integration, Apache Superset for visualisation, and PostgreSQL for metadata management, among other tools. This setup has allowed me to collect approximately 1TB of raw GitHub event data, consisting of billions of records, along with an aggregated dataset that rolls up daily, totaling over 500 million records for 2024.</p><h2>Tool Selection Criteria</h2><p>The available open source projects for each category are obviously vast, making it impractical to include every tool and project in the presented landscape. </p><p>While the <strong><a href="https://github.com/pracdata/awesome-open-source-data-engineering">GitHub page</a></strong> contains a more comprehensive list of tools, the annually published landscape only contains active projects, excluding inactive and fairly new projects with no minimal maturity or traction. However not all included tools may be fully <em>production-ready</em>; some are still on their journey toward maturity.</p><p>Without further ado, here is the <strong>2025 Open Source Data Engineering Landscape</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0JeO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0JeO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 424w, https://substackcdn.com/image/fetch/$s_!0JeO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 848w, https://substackcdn.com/image/fetch/$s_!0JeO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 1272w, https://substackcdn.com/image/fetch/$s_!0JeO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0JeO!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png" width="1200" height="930.4945054945055" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1129,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:2574422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0JeO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 424w, https://substackcdn.com/image/fetch/$s_!0JeO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 848w, https://substackcdn.com/image/fetch/$s_!0JeO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 1272w, https://substackcdn.com/image/fetch/$s_!0JeO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1440321e-63dc-4dc7-8db3-27adbf1937ad_4542x3522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Open Source Data Engineering Landscape 2025</figcaption></figure></div><p></p><h1>State of Open Source in 2025</h1><p>The open source data engineering ecosystem experienced substantial growth in 2024, with <strong>over 50 new tools</strong> added to this year's landscape while removing approximately 10 inactive and archived projects. Although not all these tools launched in 2024, they represent important additions to the ecosystem.</p><p>While this growth demonstrates continued innovation, the year also saw some concerning developments regarding licensing changes. Established projects including <strong>Redis</strong>, <strong>CockroachDB</strong>, <strong>ElasticSearch</strong>, and <strong>Kibana</strong> transitioned to more closed and proprietary licenses, though Elastic later announced a return to open source licensing.</p><p>However, these shifts were balanced by significant contributions to the open source community from major industry players. Snowflake's contribution of <strong>Polaris</strong>, Databricks' open sourcing of <strong>Unity Catalog</strong>, OneHouse's donation of <strong>Apache XTable</strong>, and Netflix's release of <strong>Maestro</strong> demonstrated ongoing commitment to open source development from industry leaders.</p><p>The <strong>Apache Foundation</strong> maintained its position as a key steward of data technologies, actively incubating several promising projects throughout 2024. Notable projects in incubation included <strong>Apache XTable</strong> (universal table format), <strong>Apache Amoro </strong>(lakehouse management), <strong>Apache HoraeDB</strong> (time-series database), <strong>Apache Gravitino </strong>(data catalog), <strong>Apache Gluten </strong>(Middleware), and <strong>Apache Polaris </strong>(data catalog). </p><p>The <strong>Linux Foundation</strong> has also strengthened its position in the data space, continuing to host exceptional projects such as <strong>Delta Lake</strong>, <strong>Amundsen</strong>, <strong>Kedro</strong>, <strong>Milvus</strong>, and <strong>Marquez</strong>. The foundation expanded its portfolio in 2024 with new significant additions, including <strong>vLLM</strong>, donated by the University of California, Berkeley, and <strong>OpenSearch</strong>, which was transferred from AWS to the Linux Foundation.</p><h2>Open Source vs Open Core vs Open Foundation</h2><p>Not all of the projects listed are fully <em><strong>interoperable</strong></em>, <em><strong>vendor-neutral</strong></em> open source tools. Some operate under an <strong><a href="https://opensource.com/article/21/11/open-core-vs-open-source">open core</a> </strong>model, where not all components of the complete system are available in the open source version. Typically, critical features such as security, governance, and monitoring are reserved for the paid versions.</p><p>Questions remain about the sustainability of the open core business model. This model faces significant challenges, leading some to believe it may give way to the <strong><a href="https://thenewstack.io/rip-open-core-long-live-open-source/">Open Foundation</a></strong> model. In this approach, open source software serves as the backbone of commercial offerings, ensuring that it remains a fully viable product for production with all the necessary features.</p><p></p><h1>Overview of Categories</h1><p>The data engineering landscape is divided into 9 major categories:</p><ol><li><p><strong>Storage Systems:</strong> Databases and storage engines spanning OLTP, OLAP, and specialised storage solutions.</p></li><li><p><strong>Data Lake Platform:</strong> Tools and frameworks for building and managing data lakes and lakehouses.</p></li><li><p><strong>Data Processing &amp; Integration:</strong> Frameworks for batch and stream processing, plus Python data processing tools.</p></li><li><p><strong>Workflow Orchestration &amp; DataOps:</strong> Tools for orchestrating data pipelines and managing data operations.</p></li><li><p><strong>Data Integration:</strong> Solutions for data ingestion, CDC (Change Data Capture), and integration between systems.</p></li><li><p><strong>Data Infrastructure:</strong> Core infrastructure components including container orchestration and monitoring.</p></li><li><p><strong>ML/AI Platform:</strong> Tools focused on ML platforms, MLOps and vector databases.</p></li><li><p><strong>Metadata Management:</strong> Solutions for data catalogs, governance, and metadata management.</p></li><li><p><strong>Analytics &amp; Visualisation:</strong> BI tools, visualisation frameworks, and analytics engines.</p></li></ol><p>In the following section latest trends, innovations and current state of major products in each category is briefly discussed.</p><h2>1. Storage Systems </h2><p>The storage systems landscape has seen significant architectural advancements in 2024, particularly in the realm of <strong> OLAP</strong> database systems. </p><p><strong><a href="https://www.pracdata.io/p/duckdb-beyond-the-hype">DuckDB</a></strong> emerged as a major success story, particularly following its 1.0 release that demonstrated production readiness for enterprise use. The new <strong>embeddable OLAP category</strong> has expanded with new entrants like <strong>chDB</strong> (built on ClickHouse), <strong>GlareDB</strong>, and <strong>SlateDB</strong>, reflecting growing demand for lightweight analytical processing capabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B-jn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B-jn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 424w, https://substackcdn.com/image/fetch/$s_!B-jn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 848w, https://substackcdn.com/image/fetch/$s_!B-jn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 1272w, https://substackcdn.com/image/fetch/$s_!B-jn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B-jn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:99734,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B-jn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 424w, https://substackcdn.com/image/fetch/$s_!B-jn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 848w, https://substackcdn.com/image/fetch/$s_!B-jn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 1272w, https://substackcdn.com/image/fetch/$s_!B-jn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1afc5b72-f240-4f05-bf0b-09e1407bbd48_3986x2138.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>OLAP Extensions &amp; HTAS</h3><p>A significant development has been the proliferation of new OLAP extensions, especially in the PostgreSQL ecosystem. </p><p>These extensions allow to seamlessly extend OLTP databases, transforming these systems into <strong>HTAP (</strong>Hybrid Transactional/Analytical Processing<strong>)</strong> or new <strong><a href="https://jack-vanlightly.com/blog/2024/5/2/hybrid-transactional-analytical-storage">HTAS (Hybrid Transactional Analytical Storage)</a></strong><a href="https://jack-vanlightly.com/blog/2024/5/2/hybrid-transactional-analytical-storage"> </a>database engine that integrate headless data storage&#8212;like data lakes and lakehouses&#8212;with transactional database systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l0nE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l0nE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 424w, https://substackcdn.com/image/fetch/$s_!l0nE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 848w, https://substackcdn.com/image/fetch/$s_!l0nE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 1272w, https://substackcdn.com/image/fetch/$s_!l0nE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l0nE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png" width="1456" height="583" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:583,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1367776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l0nE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 424w, https://substackcdn.com/image/fetch/$s_!l0nE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 848w, https://substackcdn.com/image/fetch/$s_!l0nE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 1272w, https://substackcdn.com/image/fetch/$s_!l0nE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42684118-9df6-4dd9-a5b2-b06415e5c86b_2869x1148.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MotherDuck's release of <strong><a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a></strong> represented a major advancement, enabling DuckDB to serve as an embedded OLAP engine within PostgreSQL. The <strong><a href="https://motherduck.com/blog/pg-mooncake-columnstore/">pg_mooncake</a></strong> extension followed, providing native column store capabilities in open table formats like Iceberg and Delta. Crunchy Data and ParadeDB made similar contributions through <strong><a href="https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet">pg_parquet</a></strong> and <strong>pg_analytics</strong> respectively, enabling direct analytics over Parquet files on data lakes.</p><h3>Zero-Disk Architecture</h3><p>The <em><strong>zero-disk architecture</strong></em> emerged as perhaps the most transformative trend in storage systems, fundamentally changing how database systems manage storage and compute layers. </p><p>This architectural approach completely eliminates the need for locally attached disks, instead using remote deep storage solutions like S3 object storage as the primary persistence layer. </p><p>Beyond OLAP storage systems, such as cloud data warehouses and open table formats, we are witnessing a significant emergence of this pattern in NoSQL, real-time, streaming and transactional systems.</p><p>The primary <strong>trade-off</strong> for disk-based vs zero-disk systems is <strong>cost vs performance</strong>, and the <strong>I/O latency</strong> for reading and writing data to the physical storage. While disk-based systems can manage fast sub-millisecond I/O, the zero-disk systems achieve economics of scale with cheap scalable object storage, at the cost of facing latencies up to one second when reading and writing data to a an object storage service.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C4OG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C4OG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 424w, https://substackcdn.com/image/fetch/$s_!C4OG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 848w, https://substackcdn.com/image/fetch/$s_!C4OG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 1272w, https://substackcdn.com/image/fetch/$s_!C4OG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C4OG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png" width="1456" height="1111" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1111,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1685869,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C4OG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 424w, https://substackcdn.com/image/fetch/$s_!C4OG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 848w, https://substackcdn.com/image/fetch/$s_!C4OG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 1272w, https://substackcdn.com/image/fetch/$s_!C4OG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f7c2fd-3e91-4e4f-836a-8f34c8e3a0fc_2138x1632.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>New database systems including <strong><a href="https://slatedb.io/">SlateDB</a></strong> and <strong><a href="https://horaedb.apache.org/">Apache HoraeDB</a></strong> time-series database were built from the ground up with this architecture, while established systems like <strong>Apache Doris</strong> and <strong>StarRocks</strong> adopted it in 2024. Other real-time engines such as <strong>AutoMQ</strong> and <strong><a href="https://www.bigdatawire.com/2023/04/26/influxdata-revamps-influxdb-with-3-0-release-embraces-apache-arrow/">InfluxDB 3.0</a></strong> are increasingly adopting the zero-disk paradigm.</p><p>For a comprehensive analysis of zero-disk architecture and its implications, see the detailed exploration in the following article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;672fe810-c40e-4032-8818-45f9c1beda65&quot;,&quot;caption&quot;:&quot;This is the third part in the Data Landscape Trends 2024-2025 series, focusing on the evolution of zero-disk architecture.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Zero-Disk Architecture: The Future of Cloud Storage Systems&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:3524999,&quot;name&quot;:&quot;Alireza Sadeghi&quot;,&quot;bio&quot;:&quot;I research, build and scale data platforms.\nMy primary publication is practicaldataengineering.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f2c1bd7-7ad0-45b3-a325-7e369db6965b_576x576.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-16T10:33:31.367Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f49ca93-5e08-4399-9414-428d78fbd4e8_1662x1563.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.pracdata.io/p/zero-disk-architecture-the-future&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:154871061,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:19,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Practical Data Engineering&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3>Other Notable Developments</h3><p>Following Redis's move to a proprietary license in 2024, <strong>Valkey</strong> emerged as a leading open source alternative, becoming the most-starred storage system on GitHub in 2024. Major cloud providers quickly adopted it, with Google integrating it into <strong>Memorystore</strong> and Amazon supporting it through <strong>ElastiCache</strong> and <strong>MemoryDB</strong> services.</p><p>Other notable developments include <strong>ParadeDB</strong>, an alternative to Elasticsearch built on the PostgreSQL engine, and new hybrid streaming storage systems like <strong>Proton</strong> from TimePlus and <strong><a href="https://www.ververica.com/blog/introducing-fluss">Fluss</a></strong> introduced by Ververica. These systems aim to integrate streaming and OLAP functionalities with a columnar storage foundation.</p><div><hr></div><h2>2. Data Lake Platform</h2><p>With database pioneer Michael Stonebraker <a href="https://www.bigdatawire.com/2024/07/08/dont-believe-the-big-database-hype-stonebraker-warns/">endorsing</a> the <strong>lakehouse architecture</strong> and open table formats as '<em><strong>the OLAP DBMS archetype for the next decade</strong></em>', data lakehouse continues to be the hottest topic in data engineering.</p><p>The <a href="https://www.pracdata.io/p/the-history-and-evolution-of-open">open table format landscape</a> continued to evolve significantly in 2024. The forth major open table format, <strong>Apache Paimon</strong> graduated from incubation, bringing <strong><a href="https://www.ververica.com/blog/apache-paimon-the-streaming-lakehouse">streaming lakehouse</a></strong> capabilities with Apache Flink integration. <strong>Apache XTable</strong> emerged as a new project focused on bi-directional format conversion, while <strong>Apache Amoro</strong> entered incubation with its lakehouse management framework.</p><p>In 2024, <strong>Apache Iceberg</strong> has established itself as the leading project among open table format frameworks, distinguished by its ecosystem expansion and GitHub repository metrics, including a higher number of stars, forks, pull requests, and commits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gaxs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gaxs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 424w, https://substackcdn.com/image/fetch/$s_!Gaxs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 848w, https://substackcdn.com/image/fetch/$s_!Gaxs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 1272w, https://substackcdn.com/image/fetch/$s_!Gaxs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gaxs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png" width="1049" height="700" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:700,&quot;width&quot;:1049,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104444,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gaxs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 424w, https://substackcdn.com/image/fetch/$s_!Gaxs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 848w, https://substackcdn.com/image/fetch/$s_!Gaxs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 1272w, https://substackcdn.com/image/fetch/$s_!Gaxs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bf3e0c7-ff96-4810-88ad-557a624647d0_1049x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B9NJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B9NJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 424w, https://substackcdn.com/image/fetch/$s_!B9NJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 848w, https://substackcdn.com/image/fetch/$s_!B9NJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 1272w, https://substackcdn.com/image/fetch/$s_!B9NJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B9NJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png" width="1456" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:371,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:333100,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B9NJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 424w, https://substackcdn.com/image/fetch/$s_!B9NJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 848w, https://substackcdn.com/image/fetch/$s_!B9NJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 1272w, https://substackcdn.com/image/fetch/$s_!B9NJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43594469-2a54-4c7f-bcec-a7af158ab824_2918x743.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>All major SaaS and cloud vendors have been enhancing their platforms to support access to open table formats. However, write support has been less prevalent, with <strong>Apache Iceberg</strong> being the preferred choice for comprehensive CRUD (Create, Read, Update, Delete) integration.</p><p>Google's <strong>BigLake Managed Tables</strong>, enabling mutable Iceberg tables within customer-managed cloud storage, Amazon's newly&nbsp;announced&nbsp;<strong>S3 Tables</strong>&nbsp;with native Iceberg support, and other major SaaS tools such as Redpanda launching <strong><a href="https://www.redpanda.com/blog/apache-iceberg-topics-streaming-data">Iceberg Topics</a></strong> and the <strong>Crunchy Data Warehouse</strong> <a href="https://www.crunchydata.com/blog/crunchy-data-warehouse-postgres-with-iceberg-for-high-performance-analytics">deeply integrating</a> with Apache Iceberg, are examples of increasing adoption and deep integration with Iceberg in the ecosystem.</p><p>Going forward universal table formats like Apache XTable and <strong>Delta UniForm</strong> (Delta Lake Universal Format) may face significant challenges in navigating the potential divergence of features across various formats, and the fate of open table formats may mirror that of open file formats, when Parquet emerged as the de facto standard. </p><p>As the lakehouse ecosystem continues to grow, the adoption of interoperable open standards and frameworks within an <strong>Open Data Lakehouse</strong> platform is expected to gain more popularity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8vjL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8vjL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 424w, https://substackcdn.com/image/fetch/$s_!8vjL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 848w, https://substackcdn.com/image/fetch/$s_!8vjL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 1272w, https://substackcdn.com/image/fetch/$s_!8vjL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8vjL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png" width="1456" height="1014" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1014,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2138696,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8vjL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 424w, https://substackcdn.com/image/fetch/$s_!8vjL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 848w, https://substackcdn.com/image/fetch/$s_!8vjL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 1272w, https://substackcdn.com/image/fetch/$s_!8vjL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed85a49-16eb-48cd-a36a-d3d53de82bba_1877x1307.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Emergence of Native Table Format Libraries</h3><p>A new trend is emerging in the lakehouse ecosystem focused on developing native libraries in <strong>Python</strong> and <strong>Rust</strong>. These libraries aim to provide direct access to open table formats without the need for heavy frameworks like Spark.</p><p>Notable examples include <strong><a href="https://github.com/delta-io/delta-rs">Delta-rs</a></strong>, a native Rust library for Delta Lake with Python bindings; <strong><a href="https://github.com/apache/hudi-rs">Hudi-rs</a></strong>, a Rust implementation for Apache Hudi with a Python API, and <strong><a href="https://github.com/apache/iceberg-python">PyIceberg</a></strong>, an evolving Python library designed to enhance accessibility to the Iceberg table format outside the default Spark engine.</p><div><hr></div><h2>3. Data Processing &amp; Integration</h2><h3>Rise of Single-Node Processing</h3><p>The rise of <strong>single-node processing</strong> represents a fundamental shift in data processing, challenging traditional distributed-first approaches.</p><p>Recent analyses show that many companies have overestimated their <em><strong><a href="https://motherduck.com/blog/big-data-is-dead/">big data</a></strong></em> needs, prompting a reassessment of their data processing requirements. Even in the organisations with large data volumes, approximately <strong><a href="https://motherduck.com/blog/redshift-files-hunt-for-big-data/">90% of queries</a> </strong>remain within manageable workload size to run on a single machine, only scanning recent data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0GDf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0GDf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 424w, https://substackcdn.com/image/fetch/$s_!0GDf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 848w, https://substackcdn.com/image/fetch/$s_!0GDf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 1272w, https://substackcdn.com/image/fetch/$s_!0GDf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0GDf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png" width="1200" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:464245,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0GDf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 424w, https://substackcdn.com/image/fetch/$s_!0GDf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 848w, https://substackcdn.com/image/fetch/$s_!0GDf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 1272w, https://substackcdn.com/image/fetch/$s_!0GDf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9335f59a-cf46-410d-81e1-938539f0f951_1200x812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Modern single-node processing engines, such as DuckDB, <strong>Apache DataFusion</strong>, and <strong>Polars</strong>, have emerged as powerful alternatives, capable of handling workloads that previously necessitated distributed systems like Hive/Tez, Spark, Presto or Amazon Athena.</p><p>To explore the comprehensive analysis on the state of single-node processing, please follow the link below:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1b797bcd-7e9a-4953-80ff-33944ce0752b&quot;,&quot;caption&quot;:&quot;This is part two of Data Landscape Trends 2024-2025 series, focusing on single-node processing trends.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Rise of Single-Node Processing: Challenging the Distributed-First Mindset&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:3524999,&quot;name&quot;:&quot;Alireza Sadeghi&quot;,&quot;bio&quot;:&quot;I research, build and scale data platforms.\nMy primary publication is practicaldataengineering.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f2c1bd7-7ad0-45b3-a325-7e369db6965b_576x576.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-06T11:08:03.980Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.pracdata.io/p/the-rise-of-single-node-processing&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:154253264,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:46,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Practical Data Engineering&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3>Stream Processing</h3><p>The stream processing ecosystem continued to expand in 2024, with <strong>Apache Flink</strong> further solidifying its position as the premier streaming engine, while <strong>Apache Spark</strong> retains it&#8217;s strong position. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xfQ0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xfQ0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 424w, https://substackcdn.com/image/fetch/$s_!xfQ0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 848w, https://substackcdn.com/image/fetch/$s_!xfQ0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 1272w, https://substackcdn.com/image/fetch/$s_!xfQ0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xfQ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png" width="1394" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1394,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94312,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xfQ0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 424w, https://substackcdn.com/image/fetch/$s_!xfQ0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 848w, https://substackcdn.com/image/fetch/$s_!xfQ0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 1272w, https://substackcdn.com/image/fetch/$s_!xfQ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F392eeec9-2c17-413a-bad5-064d83cc81cc_1394x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Celebrating its 10th anniversary, Flink released version 2.0, representing the first major update since Flink 1.0 debuted eight years ago. The Apache Flink ecosystem expanded significantly with the introduction of the Apache Paimon open table format and newly open sourced <strong>Fluss</strong> streaming engine. In 2024, leading cloud vendors have increasingly integrated Flink into their managed services, latest being Google&#8217;s serverless <strong><a href="https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-engine-for-apache-flink/">BigQuery Engine for Apache Flink</a></strong> solution.</p><p>Emerging streaming engines are <strong><a href="https://github.com/infinyon/fluvio">Fluvio</a></strong>, <strong><a href="https://github.com/ArroyoSystems/arroyo">Arroyo</a></strong> and <strong><a href="https://github.com/airtai/faststream">FastStream</a></strong>, striving to compete with these established contenders. <strong>Fluvio</strong> and <strong>Arroyo</strong> stand out as the only <strong><a href="https://xuanwo.io/2024/07-rewrite-bigdata-in-rust/">Rust-based engines</a></strong> which aim to eliminate the overhead typically associated with traditional JVM-based stream processing engines.</p><p>In major open source streaming news, <strong>Redpanda</strong> acquired <strong>Benthos.dev</strong>, rebranding it as <strong>Redpanda Connect</strong> and transitioning it to a more proprietary license. In response, WarpStream <strong><a href="https://www.warpstream.com/blog/announcing-bento-the-open-source-fork-of-the-project-formerly-known-as-benthos">forked</a></strong> the Benthos project, renaming it <strong>Bento</strong> and committing to keeping it 100% MIT-licensed.</p><h3>Python Processing Frameworks</h3><p>In the Python data processing ecosystem <strong>Polars</strong> is currently the dominant <strong>high-performance DataFrame</strong> library for data engineering workloads (excluding PySpark). Polars achieved an impressive <strong>89 million</strong> <strong>downloads</strong> in 2024, marking a significant milestone with its 1.0 release. </p><p>However, Polars now faces competition from DuckDB's DataFrame API, which has captured the community's attention with its remarkably simple integration with external storage systems and <em><strong>zero-copy</strong></em> integration (direct memory sharing between different systems) with <strong>Apache Arrow</strong>&#8212;similar to Polars. Both libraries rank in the <strong>top 1%</strong> of the most downloaded Python libraries last year.</p><p><strong>Apache Arrow</strong> has solidified its position as the de facto standard for in-memory data representation in the Python data processing ecosystem. The framework has established deep integration with various Python processing frameworks including Apache DataFusion, Ibis, Daft, cuDF, and Pandas 3.0. </p><p><strong>Ibis</strong> and <strong>Daft</strong> are other innovative DataFrame projects with high potential. Ibis features a seamless back-end interface to various SQL-based databases and Daft provides distributed computing capabilities, built from the ground up to support distributed DataFrame processing.</p><div><hr></div><h2>4. Workflow Orchestration &amp; DataOps</h2><p>In 2025, open source workflow orchestration category continues to stand as one of the most dynamic segments of the data engineering ecosystem, featuring <strong>over 10 active projects</strong> that range from established platforms like <strong>Apache Airflow</strong> to newly open sourced engines like Netflix's <strong><a href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78">Maestro</a></strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5NsV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5NsV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 424w, https://substackcdn.com/image/fetch/$s_!5NsV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 848w, https://substackcdn.com/image/fetch/$s_!5NsV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!5NsV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5NsV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:437834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5NsV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 424w, https://substackcdn.com/image/fetch/$s_!5NsV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 848w, https://substackcdn.com/image/fetch/$s_!5NsV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!5NsV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56574129-28a9-4d27-916e-b188d9d7f2c4_2400x1125.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>After a decade Apache Airflow continues to be the most deployed and adopted workflow orchestration engine with a staggering <strong>320M downloads</strong> in 2024 alone, while facing competition from rising competitors such as <strong>Dagster</strong>, <strong>Prefect</strong> and <strong>Kestra</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CtLm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CtLm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 424w, https://substackcdn.com/image/fetch/$s_!CtLm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 848w, https://substackcdn.com/image/fetch/$s_!CtLm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 1272w, https://substackcdn.com/image/fetch/$s_!CtLm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CtLm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png" width="1260" height="874" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1260,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CtLm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 424w, https://substackcdn.com/image/fetch/$s_!CtLm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 848w, https://substackcdn.com/image/fetch/$s_!CtLm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 1272w, https://substackcdn.com/image/fetch/$s_!CtLm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06f4930-4a1a-489b-9224-55b4ff349a67_1260x874.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Interestingly, <strong>Kestra</strong> gained the most stars on GitHub in 2024, with a surge directly linked to its $8M funding announcement in September, which was featured on <a href="https://techcrunch.com/2024/09/23/kestra-raises-another-8-million-for-its-open-source-orchestration-platform/">TechCrunch</a>. In terms of code activity, <strong>Dagster</strong> demonstrated remarkable development activity with an impressive 27K commits and close to 6K pull requests closed in 2024.</p><p>For comprehensive analysis on the state of workflow orchestration systems, read the following article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;cc597d94-9a5a-4e90-91d0-66708b24f2da&quot;,&quot;caption&quot;:&quot;This is the fifth part in the Data Landscape Trends 2024-2025 series, focusing on the state of the open source workflow orchestration systems.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;State of Open Source Workflow Orchestration Systems 2025&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:3524999,&quot;name&quot;:&quot;Alireza Sadeghi&quot;,&quot;bio&quot;:&quot;I research, build and scale data platforms.\nMy primary publication is practicaldataengineering.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f2c1bd7-7ad0-45b3-a325-7e369db6965b_576x576.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-02T20:40:32.477Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:156293319,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:14,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Practical Data Engineering&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><h3>Data Quality</h3><p><strong>Great Expectations</strong> continues to be a leading Python framework for data quality and validation also featured in Databrick's <a href="https://www.google.com/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.databricks.com/resources/ebook/state-of-data-ai&amp;ved=2ahUKEwjjyu6m-q6LAxX4JUQIHT0nFkMQFnoECBIQAQ&amp;usg=AOvVaw1kV7-XyUDOl5oHepsMrDI6">Top 10 Data and AI productions of 2024</a> , followed closely by <strong>Soda</strong> and <strong>Pandera</strong> in the data engineering practice. However, there is some disappointing news: the <strong>Data-Diff</strong> project has been archived by its main maintainer, Datafold in 2024.</p><h3>Data Versioning</h3><p>Data Versioning remains a prominent topic in 2024, as efforts continue to bring the capabilities of modern version control systems, like Git, to data lakes and lakehouses. </p><p>Projects like <strong><a href="https://github.com/treeverse/lakeFS">LakeFS</a></strong> and <strong><a href="https://github.com/projectnessie/nessie">Nessie</a></strong>, enhance modern data lakes and open table formats such as Iceberg and Delta Lake by extending their transactional metadata layers. </p><h3>Data Transformation</h3><p>The scope of using <strong>dbt</strong> for data transformation is expanding beyond its original focus on data modeling within data warehouse systems. It is now making inroads into <strong>off-warehouse</strong> environments, such as data lakes, through new integrations and plugins that leverage ephemeral compute engines like <strong><a href="https://github.com/starburstdata/dbt-trino">Trino</a></strong>.</p><p>Currently, dbt faces competition primarily from <strong>SQLMesh</strong>. A notable stand-off in 2024 was the SQLMesh vs. dbt debate, highlighted by <strong>Tobiko's CEO</strong>, who claimed on social media that <em>SQLMesh is so good it's banned from dbt's Coalesce conference!</em></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>5. Data Integration</h2><p>In the data integration space, <strong>Airbyte</strong> maintained its leadership position, achieving an impressive milestone by closing 13K pull requests in preparation for version 1.x. The <strong>dlt</strong> framework demonstrated significant maturation with its 1.0 release, while <strong>Apache</strong> <strong>SeaTunnel</strong> gained traction as a compelling alternative. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3x4d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3x4d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 424w, https://substackcdn.com/image/fetch/$s_!3x4d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 848w, https://substackcdn.com/image/fetch/$s_!3x4d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!3x4d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3x4d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png" width="1456" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245231,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3x4d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 424w, https://substackcdn.com/image/fetch/$s_!3x4d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 848w, https://substackcdn.com/image/fetch/$s_!3x4d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!3x4d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477b0bf1-0351-4fcb-a749-c00e597a24a1_2369x1250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <strong>Change Data Capture (CDC)</strong> framework landscape evolved with new tools including <strong><a href="https://github.com/artie-labs/transfer">Artie Transfer</a></strong> and <strong><a href="https://github.com/PeerDB-io/peerdb">PeerDB</a> (</strong>acquired by ClickHouse<strong>)</strong>, while <strong><a href="https://github.com/apache/flink-cdc">Flink CDC</a></strong> connectors gaining adoption among platforms using Flink as their primary streaming engine.</p><h3>Event Hubs (Streaming Pub/Sub Services)</h3><p>One of the most notable innovations in the data integration space in 2024 came from the evolving data streaming landscape. A significant architectural shift in this category is the <em><strong>separation of storage and compute</strong></em>, coupled with the adoption of object storage in a zero-disk architecture. <strong>WarpSteram</strong> is a pioneer of implementing this architecture in real-time streaming space.</p><p>This model also enables a flexible <strong>Bring Your Own Cloud (BYOC)</strong> <a href="https://www.kai-waehner.de/blog/2024/09/12/deployment-options-for-apache-kafka-self-managed-fully-managed-serverless-and-byoc-bring-your-own-cloud/">deployment strategy</a>, as both compute and storage can be hosted on the customer's preferred infrastructure, while the service provider maintains the control plane.</p><p>WarpStream's success has prompted major competitors to adopt similar architectures. Redpanda launched <strong><a href="https://www.redpanda.com/blog/cloud-topics-streaming-data-object-storage">Cloud Topics</a></strong>, enhancing its offerings, while <strong><a href="https://www.automq.com/blog/introducing-automq-cloud-native-replacement-of-apache-kafka">AutoMQ</a></strong>  implemented a hybrid approach featuring a fast caching layer to improve I/O performance. </p><p>Additionally, StreamNative introduced the <strong><a href="https://streamnative.io/products/ursa">Ursa</a> </strong>engine for Apache Pulsar, and Confluent unveiled its own cloud-native <strong><a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">Freight Clusters</a></strong> in 2024. Ultimately, Confluent decided to acquire WarpStream, further expanding its offering with BYOC model. Meanwhile, the remarkable <strong>Apache Kafka</strong> stands at a <strong><a href="https://blog.det.life/kafka-has-reached-a-turning-point-649bd18b967f">crossroads</a></strong> that may define its future direction in the ecosystem.</p><div><hr></div><h2>6. Data Infrastructure</h2><p>The data infrastructure landscape in 2024 remained largely stable, with <strong>Kubernetes</strong> celebrating its 10th anniversary while maintaining its position as the leading resource scheduling and virtualisation engine in cloud environments. </p><p>In the observability space, <strong>InfluxDB</strong>, <strong>Prometheus</strong>, and <strong>Grafana</strong> continued their dominance, with <strong>Grafana Labs</strong> securing a notable <a href="https://www.bigdatawire.com/2024/08/26/grafana-labs-raises-270m-boosting-valuation-to-over-6b/">$270M funding round</a> that reinforced the long-term viability of their core products like Grafana as general-purpose observability solutions.</p><div><hr></div><h2>7. ML/AI Platform</h2><p>Vector databases maintained strong momentum from 2023, with <strong>Milvus</strong> emerging as a leader alongside <strong>Qdrant</strong>, <strong>Chroma</strong>, and <strong>Weaviate</strong>. The category now encompasses ten active vector database projects, reflecting the growing importance of vector search capabilities in modern AI-enabled data architectures. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U2wg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U2wg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 424w, https://substackcdn.com/image/fetch/$s_!U2wg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 848w, https://substackcdn.com/image/fetch/$s_!U2wg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 1272w, https://substackcdn.com/image/fetch/$s_!U2wg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U2wg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231173,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U2wg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 424w, https://substackcdn.com/image/fetch/$s_!U2wg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 848w, https://substackcdn.com/image/fetch/$s_!U2wg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 1272w, https://substackcdn.com/image/fetch/$s_!U2wg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb976acaf-a6eb-47af-9533-1a0ea552188e_2326x1257.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The introduction of <strong>LLMOps</strong> (also referred to as <strong><a href="https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-build-and-scale-generative-ai-solutions-with-genops/">GenOps</a></strong>) as a distinct category in this year's presented landscape was marked by the rapid growth of new projects like <strong><a href="https://dify.ai/">Dify</a></strong> and <strong><a href="https://github.com/vllm-project/vllm">vLLM</a></strong> purposefully built for managing LLM models.</p><div><hr></div><h2>8. Metadata Management</h2><p>Metadata management platforms have gained significant momentum in recent years, with <strong>DataHub</strong> leading the open source space through its active development and community engagement. </p><p>However, the most notable developments in 2024 occurred in catalog management. While 2023 was dominated by competition in open table formats, 2024 marked the beginning of the <strong>Catalog War</strong>.</p><p>In contrast to earlier years, 2024 brought a wave of new open catalog solutions to the market, including <strong>Polaris</strong> (open sourced by Snowflake), <strong>Unity Catalog</strong> (open sourced by Databricks), <strong>LakeKeeper</strong>, and <strong>Apache Gravitino</strong>. </p><p>This proliferation reflects the realisation that emerging data lakehouse platforms, which rely heavily on open table formats, lack advanced built-in catalog management capabilities for seamless multi-engine interoperability.</p><p>All of these projects have the potential to establish a new standard for vendor-agnostic, open catalog services in data lakehouse platforms. Much like <strong>Hive Metastore</strong> became the de facto standard for Hadoop-based platforms, these emerging catalogs may finally replace Hive Metastore's long-standing dominance in catalog management on open data platforms.</p><div><hr></div><h2>9. Analytics &amp; Visualisation</h2><p>In the open source Business Intelligence realm, <strong>Apache Superset</strong> and <strong>Metabase</strong> remain the leading BI solutions. While Superset leads in GitHub popularity, Metabase shows the highest development activity. <strong><a href="https://github.com/lightdash/lightdash">Lightdash</a></strong> emerged as a promising newcomer, securing $11 million in funding and demonstrating market demand for lightweight BI solutions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TtiP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TtiP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 424w, https://substackcdn.com/image/fetch/$s_!TtiP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 848w, https://substackcdn.com/image/fetch/$s_!TtiP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!TtiP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TtiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png" width="1456" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:259185,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TtiP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 424w, https://substackcdn.com/image/fetch/$s_!TtiP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 848w, https://substackcdn.com/image/fetch/$s_!TtiP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!TtiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d29dd8-4b91-4844-bb92-dc90bbc6f06e_2571x1244.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>BI-as-Code Solutions</h3><p><strong>BI-as-Code</strong> emerged as a distinctive category, driven by the continued success of <strong>Streamlit</strong>, which maintained its position as the most popular BI-as-Code solution. </p><p>These tools enable developers to create interactive apps and lightweight BI dashboards using code, SQL and templates like Markdown or YAML, being able to combine the software engineering best practices, such as version control, testing and CI/CD into the dashboard development workflow.</p><p>In addition to Streamlit and the well-known <strong>Evidence</strong>, new entrants like <strong>Quary</strong> and <strong>Vizro</strong> have gained traction, with Quary notably implementing a Rust-based approach that diverged from the Python-centric norm of the category.</p><h3>Composable BI Stack</h3><p>The evolution of <em><strong>system decomposition</strong></em> is not limited to storage systems; it has also impacted Business Intelligence (BI) stacks. A new trend is emerging that combines lightweight, <em><strong>bottomless</strong></em> BI tools (which don't have a back-end server) with <em><strong>headless</strong></em> embeddable OLAP solutions such as Apache DataFusion, Apache Arrow, and DuckDB. </p><p>This integration addresses several gaps in the the open source BI stack such as native ability to query external data lakes and lakehouses while preserving the benefits of lightweight, disaggregated architectures.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W_Vo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W_Vo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 424w, https://substackcdn.com/image/fetch/$s_!W_Vo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 848w, https://substackcdn.com/image/fetch/$s_!W_Vo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 1272w, https://substackcdn.com/image/fetch/$s_!W_Vo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W_Vo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png" width="1456" height="1309" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1309,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1108028,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W_Vo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 424w, https://substackcdn.com/image/fetch/$s_!W_Vo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 848w, https://substackcdn.com/image/fetch/$s_!W_Vo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 1272w, https://substackcdn.com/image/fetch/$s_!W_Vo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9792b724-6b27-4c31-9254-2a1765007842_1612x1449.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>BI Products like <strong><a href="https://www.exploreomni.com/blog/DuckDB-complements-BI">Omni</a></strong>, <strong><a href="https://www.gooddata.com/blog/analytics-stack-with-apache-arrow/">GoodData</a></strong>, <strong><a href="https://evidence.dev/blog/why-we-built-usql">Evidence</a></strong>, and <strong><a href="https://www.rilldata.com/blog/why-we-built-rill-with-duckdb">Rilldata</a></strong> have already incorporated these engines into their BI and data exploration tools. Both Apache Superset (using the <a href="https://pypi.org/project/duckdb-engine/">duckdb-engine</a> library) and Metabase now support embedded DuckDB connections.</p><p>For a comprehensive analysis of the evolving <strong>composable BI architecture</strong>, see the detailed exploration in the following article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;59c084b6-a742-4837-b2eb-35f4872e9f7d&quot;,&quot;caption&quot;:&quot;As we dive into 2025, the data engineering field continues its dramatic evolution. In this series, we'll explore the transformative trends reshaping the data engineering landscape, from emerging architectural patterns to new tooling approaches.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Evolution of Business Intelligence: From Monolithic to Composable Architecture&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:3524999,&quot;name&quot;:&quot;Alireza Sadeghi&quot;,&quot;bio&quot;:&quot;I research, build and scale data platforms.\nMy primary publication is practicaldataengineering.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f2c1bd7-7ad0-45b3-a325-7e369db6965b_576x576.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-12-18T13:06:41.172Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d3e7046-6c4c-49f3-b45b-a339b6945f4b_403x364.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.pracdata.io/p/the-evolution-of-business-intelligence-stack&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:153307739,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:25,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Practical Data Engineering&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3>MPP Query Engines</h3><p>Post-Hadoop era there has been little innovation and introduction of new open source MPP (Massively Parallel Processing) systems while existing engines continue to mature. </p><p>While Hive's share is shrinking, <strong>Presto</strong> and <strong>Trino</strong> still remain as top open source MPP query engines used in production, despite facing fierce competition from Spark as a unified engine, and managed cloud MPP products such as Databricks, Snowflake and AWS Redshift Spectrum plus Athena.</p><div><hr></div><h1>Future Outlook and Conclusion</h1><p>The open source data ecosystem is entering a phase of maturity in key areas such as data lakehouse, characterised by consolidation around proven technologies and increased focus on operational efficiency. </p><p>The landscape continues to evolve toward cloud-native, composable architectures while standardising around dominant technologies. Key areas to watch include:</p><ul><li><p>Further consolidation in the open table format space</p></li><li><p>Continued evolution of zero-disk architectures in real-time and transactional systems</p></li><li><p>Quest toward providing a unified lakehouse experience</p></li><li><p>The rise of LLMOps and AI Engineering</p></li><li><p>The expansion of the data lakehouse ecosystem in areas such as open catalog integration and development of native libraries</p></li><li><p>The increasing traction of single-node data processing and embedded analytics</p><p></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[State of Open Source Workflow Orchestration Systems 2025]]></title><description><![CDATA[Overview of Major 2024 Trends and Emerging Technologies Shaping 2025]]></description><link>https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025</link><guid isPermaLink="false">https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 02 Feb 2025 20:40:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uBXk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uBXk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 424w, https://substackcdn.com/image/fetch/$s_!uBXk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 848w, https://substackcdn.com/image/fetch/$s_!uBXk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!uBXk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uBXk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png" width="1456" height="1030" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1030,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:808858,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uBXk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 424w, https://substackcdn.com/image/fetch/$s_!uBXk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 848w, https://substackcdn.com/image/fetch/$s_!uBXk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 1272w, https://substackcdn.com/image/fetch/$s_!uBXk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d365c5-90aa-4c05-b1c5-2d6ab1f322f1_2004x1417.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the fifth part in the <strong>Data Landscape Trends 2024-2025</strong> series, focusing on the state of the open source workflow orchestration systems.</p><h1>Introduction</h1><p>In the rapidly evolving landscape of data engineering, workflow orchestration engines play a key role in managing complex data processes.</p><p>This analysis explores the current state of workflow orchestration engines through multiple lenses such as community engagement, technical architecture, adoption metrics, and emerging innovations in 2024.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cJAS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cJAS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 424w, https://substackcdn.com/image/fetch/$s_!cJAS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 848w, https://substackcdn.com/image/fetch/$s_!cJAS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 1272w, https://substackcdn.com/image/fetch/$s_!cJAS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cJAS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png" width="1456" height="1689" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1689,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7314446,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cJAS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 424w, https://substackcdn.com/image/fetch/$s_!cJAS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 848w, https://substackcdn.com/image/fetch/$s_!cJAS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 1272w, https://substackcdn.com/image/fetch/$s_!cJAS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F032c67aa-cc5b-4a78-8746-e33cae83fbb9_2721x3156.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We'll cover the following topics:</p><ul><li><p>Current open source workflow orchestration landscape</p></li><li><p>Open Source vs Open Core engines</p></li><li><p>Task-centric vs Data-centric engines</p></li><li><p>GitHub repository trends in 2024</p></li><li><p>Summary and analysis of the current state of the products</p></li><li><p>Major 2024 workflow orchestration trends</p></li><li><p>Recommendations and conclusion</p></li></ul><h1>Current OSS Landscape and Major Products</h1><p>The workflow orchestration category stands out as one of the most dynamic segments of the open source data engineering ecosystem. </p><p>It features over 10 active projects that range from established products like <strong>Apache Airflow</strong> to newly open sourced engines like <strong>Netflix's Maestro.</strong></p><p>The evolution of major open source workflow orchestration engines traces back to 2008, when <strong>Yahoo</strong> developed the first significant workflow engine <strong>Oozie</strong> to address the growing complexity of managing workloads on the Hadoop platform. </p><p>Since then, the industry has developed numerous orchestration systems to meet the growing demands of workload management and orchestration on data platforms.</p><p>Some projects, such as <strong>Orchest</strong>, have been come and gone, and are no longer maintained. Such retired projects are excluded from this analysis.</p><p>The timeline below illustrates the development progression of major open source workflow orchestration engines, highlighting both their initial open source releases and subsequent donations to the open source foundation where applicable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q1Sz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q1Sz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 424w, https://substackcdn.com/image/fetch/$s_!q1Sz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 848w, https://substackcdn.com/image/fetch/$s_!q1Sz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 1272w, https://substackcdn.com/image/fetch/$s_!q1Sz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q1Sz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png" width="1456" height="1028" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1028,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1142542,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q1Sz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 424w, https://substackcdn.com/image/fetch/$s_!q1Sz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 848w, https://substackcdn.com/image/fetch/$s_!q1Sz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 1272w, https://substackcdn.com/image/fetch/$s_!q1Sz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3884b3a-ae6f-46a2-b0fa-1e97396b0a77_3437x2426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I remember back in 2018 when we had to pick a workflow engine for a new large-scale data platform. Our options were pretty much just <strong>Luigi</strong>, <strong>Azkaban</strong>, and <strong>Airflow</strong>. </p><p>The choice was really simple back then - Airflow was a clear winner since it was Python-based and had great features. But nowadays it's so much harder to navigate this landscape and do a proper comparison of architectures and features between all the available tools in the market.</p><h2>Netflix's New Contribution</h2><p>An exciting development in this ecosystem came when Netflix open sourced their next-generation orchestrator, <strong>Maestro</strong>, in July 2024. </p><p>Introduced via their <strong><a href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78">tech blog</a></strong>, Maestro is designed as a highly scalable and flexible scheduler capable of handling large-scale heterogeneous workflows like ML training and data pipelines.</p><p>What makes Maestro stand out is its flexible execution support for Docker images and notebooks, along with its ability to handle both cyclic and acyclic (DAG) workflow patterns. </p><p>Since its July release, Maestro has gained notable traction in the community. However, the repository has seen limited code activity since the initial release. </p><h2>Back-end Language</h2><p>In terms of back-end languages, these tools have a fairly even distribution between <strong>Java</strong>, <strong>Go</strong>, and <strong>Python</strong>, with the exception of <strong>Windmill</strong>, which is built using the rising <strong>Rust</strong> language.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7LqY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7LqY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 424w, https://substackcdn.com/image/fetch/$s_!7LqY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 848w, https://substackcdn.com/image/fetch/$s_!7LqY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 1272w, https://substackcdn.com/image/fetch/$s_!7LqY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7LqY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png" width="1333" height="1405" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1405,&quot;width&quot;:1333,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:493354,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7LqY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 424w, https://substackcdn.com/image/fetch/$s_!7LqY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 848w, https://substackcdn.com/image/fetch/$s_!7LqY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 1272w, https://substackcdn.com/image/fetch/$s_!7LqY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7d97762-6f24-4da7-b6b2-8e6a07b34722_1333x1405.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Open Source vs Open Core Engines</h2><p>It's important to note that not all these projects are truly open source. Some follow an "<strong><a href="https://opensource.com/article/21/11/open-core-vs-open-source">open core</a></strong>" model instead, where the main SaaS provider only releases certain core components as open source while keeping premium features such as monitoring and security, proprietary. </p><p>When evaluating these tools for adoption, it's crucial to assess how portable and genuinely open source each project really is, as this can impact long-term sustainability and cost.</p><p>Many current open core tools like <strong>Kestra</strong> and <strong>Dagster</strong> keep essential enterprise features &#8211; especially security features like SSO &#8211; locked behind their enterprise versions. This is a deliberate strategy to monetise enterprise clients who need these capabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f5Cq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f5Cq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 424w, https://substackcdn.com/image/fetch/$s_!f5Cq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 848w, https://substackcdn.com/image/fetch/$s_!f5Cq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 1272w, https://substackcdn.com/image/fetch/$s_!f5Cq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f5Cq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png" width="1456" height="1555" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1555,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2517476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f5Cq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 424w, https://substackcdn.com/image/fetch/$s_!f5Cq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 848w, https://substackcdn.com/image/fetch/$s_!f5Cq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 1272w, https://substackcdn.com/image/fetch/$s_!f5Cq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F000ed656-9798-4626-818f-a0b03ab89907_2525x2696.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This approach creates a significant problem for OSS adoptions: businesses that care about security and governance can't realistically use the open source versions of these products. </p><p>Open Core users frequently complain about this limitation, particularly the lack of basic authentication and authorisation mechanisms in the open core versions.</p><p>Currently, only projects like <strong>Apache Airflow</strong>, <strong>Flyte</strong> and <strong>Apache DolphinScheduler</strong> are guaranteed to remain fully open source, as they're not owned by any single commercial entity but rather governed by an open source Foundation.</p><h2>Task-Centric vs Data-Centric Engines</h2><p>Workflow orchestration engines can be broadly classified by their fundamental approach to workflow management: task-centric versus data-centric, alongside other categories like declarative vs code-based and batch vs event-driven.</p><h3>Task-Centric Orchestrators</h3><p><strong>Airflow</strong>, <strong>Luigi</strong>, <strong>Cadence</strong>, and <strong>Kestra</strong> exemplify the task-centric approach, organising workflows as <strong>Directed Acyclic Graphs (DAGs)</strong> of interconnected tasks. </p><p>In these engines, the <em><strong>task</strong></em> is the primary unit of work, capable of executing any type of operation. The scheduler's main concern is managing <em><strong>control flow</strong></em> and dependencies between tasks within the DAG, remaining largely agnostic to the actual work being performed.</p><h3>Data-Centric Orchestrators</h3><p>Engines like <strong>Dagster</strong>, <strong>Temporal</strong>, and <strong>Flyte</strong> take a more opinionated, data-centric approach. In these engines, data-oriented objects (or "<em><strong>assets</strong></em>" in Dagster's terminology) serves as the primary focus of the workflow. </p><p>They treat workflows as data-aware pipelines where assets - whether tables, files, ML models, or dbt models - are produced, consumed, and transformed.</p><p>Data-centric engines provide native support for passing data between tasks and offer superior integration with modern data transformation frameworks like <strong>dbt</strong> and <strong>SQLMesh</strong>, compared to task-centric engines.</p><div><hr></div><h1>GitHub Repository Trends</h1><p>Open source projects are typically evaluated through key metrics including GitHub stars, download counts, contributor activity, and repository engagement (measured by commits, releases, and issue resolution rates).</p><p>As part of my commitment to understanding the open source ecosystem, I've developed my own <a href="https://practicaldataengineering.substack.com/p/building-data-pipeline-using-duckdb">small analytical platform</a> that tracks and analyses all GitHub events for public repositories. The following metrics and trends for 2024 are derived from this platform.</p><h2>Project Popularity</h2><p>Looking at GitHub repository star trends in 2024, <strong>Kestra</strong> has emerged as a rising workflow orchestration project. The graph below shows a spike in September, when Kestra surpassed all other projects in new stars gained in 2024. </p><p>This surge is directly linked to Kestra's $8M funding announcement, which was featured in <a href="https://techcrunch.com/2024/09/23/kestra-raises-another-8-million-for-its-open-source-orchestration-platform/">TechCrunch</a>. It's a clear example of how repository stars can spike in response to major company announcements.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PvwL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PvwL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 424w, https://substackcdn.com/image/fetch/$s_!PvwL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 848w, https://substackcdn.com/image/fetch/$s_!PvwL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!PvwL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PvwL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/daf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:437672,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PvwL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 424w, https://substackcdn.com/image/fetch/$s_!PvwL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 848w, https://substackcdn.com/image/fetch/$s_!PvwL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 1272w, https://substackcdn.com/image/fetch/$s_!PvwL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf1f85f-820e-4490-aef3-44c6c179481e_2400x1125.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The well-established <strong>Apache Airflow</strong> and <strong>Prefect</strong> ranked the second and third most-starred workflow projects in 2024 respectively.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HQIQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HQIQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 424w, https://substackcdn.com/image/fetch/$s_!HQIQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 848w, https://substackcdn.com/image/fetch/$s_!HQIQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!HQIQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HQIQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png" width="1456" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233007,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HQIQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 424w, https://substackcdn.com/image/fetch/$s_!HQIQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 848w, https://substackcdn.com/image/fetch/$s_!HQIQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!HQIQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf516462-124a-4b00-9f8d-b1be28c871cb_2079x1294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Code Activity</h2><p>Code activity in open source projects can be measured by two main metrics: pull requests (opened, closed, and reviewed) and commit volume (push events).</p><p>For 2024, <strong>Dagster</strong> and <strong>Airflow</strong> led the pack in pull request activity, each processing over 10K PRs from their contributors, with Prefect following close behind. </p><p>On the other end of the spectrum, projects like <strong>Cadence</strong>, <strong>Luigi</strong>, <strong>Maestro</strong>, and <strong>Azkaban</strong> showed concerning levels of inactivity, raising questions about their long-term health.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!58Sb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!58Sb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 424w, https://substackcdn.com/image/fetch/$s_!58Sb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 848w, https://substackcdn.com/image/fetch/$s_!58Sb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 1272w, https://substackcdn.com/image/fetch/$s_!58Sb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!58Sb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png" width="1456" height="1860" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1860,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:493893,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!58Sb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 424w, https://substackcdn.com/image/fetch/$s_!58Sb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 848w, https://substackcdn.com/image/fetch/$s_!58Sb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 1272w, https://substackcdn.com/image/fetch/$s_!58Sb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b915c76-83dd-4260-87bc-4faa0a4b5fa1_2143x2737.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Looking at commit volume, <strong>Dagster</strong> demonstrated remarkable development activity with an impressive 27K commits in 2024. <strong>Prefect</strong> and <strong>Windmill</strong> also showed strong development momentum, each recording over 10K commits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oXS1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oXS1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 424w, https://substackcdn.com/image/fetch/$s_!oXS1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 848w, https://substackcdn.com/image/fetch/$s_!oXS1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 1272w, https://substackcdn.com/image/fetch/$s_!oXS1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oXS1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png" width="1456" height="832" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:832,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:219361,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oXS1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 424w, https://substackcdn.com/image/fetch/$s_!oXS1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 848w, https://substackcdn.com/image/fetch/$s_!oXS1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 1272w, https://substackcdn.com/image/fetch/$s_!oXS1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faced497b-a685-49c4-b5ef-845ae7e3ff66_2235x1277.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>Project Collaboration</h3><p>The health and sustainability of an open source project largely depends on its contributor base &#8211; the wider and more diverse, the better.</p><p>When evaluating contributor metrics, it's crucial to distinguish between <em><strong>active contributors</strong></em> who consistently work throughout the year and one-off contributors who make occasional submissions. Active contributors provide a more meaningful measure of project health.</p><p>Looking at active contributors in 2024, <strong>Airflow</strong> and <strong>Dagster</strong> lead the ecosystem with over 20 active contributors each. Any major open source project with few (ex &lt; 5) active contributors raises sustainability concerns. By this metric, projects like Argo Workflows, Mage-ai, DolphinScheduler, and Flyte fall into a warning zone.</p><p>At the concerning end of the spectrum, projects like <strong>Luigi</strong> and <strong>Azkaban</strong> showed no active contributions throughout the year.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WCRE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WCRE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 424w, https://substackcdn.com/image/fetch/$s_!WCRE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 848w, https://substackcdn.com/image/fetch/$s_!WCRE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 1272w, https://substackcdn.com/image/fetch/$s_!WCRE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WCRE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png" width="1456" height="1771" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1771,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:355582,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WCRE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 424w, https://substackcdn.com/image/fetch/$s_!WCRE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 848w, https://substackcdn.com/image/fetch/$s_!WCRE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 1272w, https://substackcdn.com/image/fetch/$s_!WCRE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dc87044-d810-496d-af0b-f8d042a8de6d_1566x1905.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Community Engagement</h2><p>Community engagement can be measured through several indicators: <em><strong>issues</strong></em> logged, <em><strong>comment</strong></em> volume, and participation in official community channels like <strong>Slack</strong> and discussion boards. These metrics help determine how vibrant and active a tool's community really is.</p><p>Another key metric is the ratio of closed to opened issues, which indicates how quickly project maintainers address community-reported problems.</p><p>Looking at GitHub activity in terms of total issues opened and closed, <strong>Airflow</strong>, <strong>Kestra</strong>, <strong>Prefect</strong>, and <strong>DolphinScheduler</strong> show the strongest community engagement.</p><p>Based on total issues registered, We can consider fewer than 100 issues or less than 50% issue resolution a concern, and fewer than 50 issues a danger zone. Again, projects like <strong>Luigi</strong>, <strong>Azkaban</strong>, and <strong>Cadence</strong> fall into this danger zone, suggesting minimal community interaction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dBF5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dBF5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 424w, https://substackcdn.com/image/fetch/$s_!dBF5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 848w, https://substackcdn.com/image/fetch/$s_!dBF5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 1272w, https://substackcdn.com/image/fetch/$s_!dBF5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dBF5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:309529,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dBF5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 424w, https://substackcdn.com/image/fetch/$s_!dBF5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 848w, https://substackcdn.com/image/fetch/$s_!dBF5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 1272w, https://substackcdn.com/image/fetch/$s_!dBF5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569d5336-03ce-43f9-8516-4414774e3a53_2337x1390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Downloads &amp; Installations</h2><p>Most open source orchestration tools are either Python-based or provide Python client and SDKs, making PyPI download statistics a useful metric for measuring adoption and popularity.</p><p>Looking at the download stats from <em><a href="http://clickpy.clickhouse.com">clickpy.clickhouse.com</a></em>, <strong>Apache Airflow</strong> dominates the ecosystem with a staggering <strong>320M downloads </strong>in 2024 alone - ten times more than its nearest competitor. This reinforces Airflow's position as the leading tool in the entire data engineering ecosystem.</p><p><strong>Prefect</strong> and <strong>Dagster</strong> round out the top three most downloaded packages in 2024, with 32M and 15M downloads respectively.</p><p>An interesting observation: despite being an inactive project, <strong>Luigi</strong> recorded 5.6M downloads in 2024. This likely reflects existing users updating to minor releases, suggesting a significant legacy user base still relies on the platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tH1y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tH1y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 424w, https://substackcdn.com/image/fetch/$s_!tH1y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 848w, https://substackcdn.com/image/fetch/$s_!tH1y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 1272w, https://substackcdn.com/image/fetch/$s_!tH1y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tH1y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png" width="1316" height="913" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:913,&quot;width&quot;:1316,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125160,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tH1y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 424w, https://substackcdn.com/image/fetch/$s_!tH1y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 848w, https://substackcdn.com/image/fetch/$s_!tH1y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 1272w, https://substackcdn.com/image/fetch/$s_!tH1y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27208d7f-5905-4ba7-93e8-e250c080280f_1316x913.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Summary &amp; Analysis</h2><p>Here is the summary of the evaluation of the workflow orchestration engines across key GitHub metrics:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dXPl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dXPl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 424w, https://substackcdn.com/image/fetch/$s_!dXPl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 848w, https://substackcdn.com/image/fetch/$s_!dXPl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 1272w, https://substackcdn.com/image/fetch/$s_!dXPl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dXPl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png" width="1456" height="1899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1899,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3981218,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dXPl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 424w, https://substackcdn.com/image/fetch/$s_!dXPl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 848w, https://substackcdn.com/image/fetch/$s_!dXPl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 1272w, https://substackcdn.com/image/fetch/$s_!dXPl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef68785-9014-4a25-b9a6-7cd734ce083f_2256x2943.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Advancing Projects </h3><p>After a decade, <strong>Airflow</strong> remains the dominant force in open source orchestration, maintaining the most active and vibrant open source project in the market.</p><p><strong>Dagster</strong> is likely the second most popular orchestrator in 2024. Along with <strong>Prefect</strong> and <strong>Temporal</strong>, it's capturing significant market attention, particularly among startups and smaller-scale deployments. </p><p>These tools stand out for their simplified approach to data-centric workflow management, more intuitive UIs, and enhanced support for event-driven workflows.</p><h3>Rising Projects</h3><p><strong>Kestra</strong> has become one of the fastest-growing orchestration tools in 2024, gaining momentum after securing $8M in funding. The project has also been praised for its simplicity, declarative YAML-based workflow definitions, and support for event-driven workflows.</p><h3>Declining Projects</h3><p>Legacy tools <strong>Luigi</strong> and <strong>Azkaban</strong> rank at the bottom across all metrics. While neither project has been officially archived or retired, their lack of meaningful development activity in 2024 effectively marks them as inactive. </p><p>Luigi saw only minor bug fixes throughout the year, while Azkaban showed no code activity whatsoever. This dramatic decline in maintenance suggests these once-popular orchestrators have reached the end of their active lifecycle.</p><p>The future of <strong>Netflix's Maestro</strong> remains uncertain. 2025 will be a pivotal year, revealing whether the project gains momentum on GitHub or follows the path of some other abandoned in-house tools released by tech giants.</p><h2>2024 OSS Orchestration Competition</h2><p>Let's turn this into a competition and rank our open source workflow engines! </p><p>We'll identify the top three performers across key metrics in 2024, creating a sort of "<em><strong>workflow orchestrator competition</strong></em>."</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P7Lz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P7Lz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 424w, https://substackcdn.com/image/fetch/$s_!P7Lz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 848w, https://substackcdn.com/image/fetch/$s_!P7Lz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 1272w, https://substackcdn.com/image/fetch/$s_!P7Lz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P7Lz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png" width="1456" height="1388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1388,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:760979,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P7Lz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 424w, https://substackcdn.com/image/fetch/$s_!P7Lz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 848w, https://substackcdn.com/image/fetch/$s_!P7Lz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 1272w, https://substackcdn.com/image/fetch/$s_!P7Lz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ed904a-3982-4045-930c-31c8cf9b66b0_2317x2208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Based on our OSS metrics and medal counts, <strong>Apache Airflow</strong> claims the crown as 2024's champion workflow orchestrator, with <strong>Dagster</strong> taking second and <strong>Prefect</strong> earning third place.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TcEs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TcEs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 424w, https://substackcdn.com/image/fetch/$s_!TcEs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 848w, https://substackcdn.com/image/fetch/$s_!TcEs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 1272w, https://substackcdn.com/image/fetch/$s_!TcEs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TcEs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png" width="1456" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:896163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TcEs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 424w, https://substackcdn.com/image/fetch/$s_!TcEs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 848w, https://substackcdn.com/image/fetch/$s_!TcEs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 1272w, https://substackcdn.com/image/fetch/$s_!TcEs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db5c851-c9dc-4575-8b0f-bd933b85d7b0_1631x956.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Important note:</strong> <em>This comparison focuses solely on open source activity metrics and community engagement. It should not be interpreted as a judgment of each tool's features, capabilities, or overall ecosystem. The best workflow orchestrator for your needs will depend on your specific requirements, use cases, and technical environment.</em></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>Major 2024 Trends</h1><p>Let&#8217;s explore the key development trends in the workflow orchestration ecosystem for 2024.</p><h2>Event-Driven &amp; Real-Time Orchestration</h2><p>The workflow orchestration ecosystem is shifting toward event-triggered and real-time processing capabilities, reflecting industry's growing demand for real-time workload management.</p><p>In 2024, several major products made significant moves in this direction. <strong>Kestra</strong> introduced Real-time and HTTP Triggers, enabling millisecond-latency responses to events from systems like <strong>Kafka</strong> and AWS SQS, and over HTTP requests. </p><p><strong>Temporal</strong> enhanced its real-time capabilities with <em>Workflow Update</em> and <em>Workflow Update-With-Start</em> features, enabling synchronous processing for interactive applications. Meanwhile, <strong>DolphinScheduler</strong> expanded its event-driven architecture with a variety of new triggers.</p><p><strong>Mage</strong> focused on real-time data processing, introducing <em>Streaming Pipelines</em> that support real-time ingestion and transformation from sources like Kafka and Google Pub/Sub.</p><p>Even <strong>Apache Airflow</strong>, traditionally a batch-oriented system, has recognised this shift toward real-time processing. Its 2024 updates introduced addition of new conditions for its data-aware scheduling, and new scheduling mechanism which supports scheduling DAGs based on both dataset events and time.</p><h2>AI/LLM Integration &amp; Automation</h2><p>The integration of AI and LLM capabilities emerged as another major trend in workflow orchestration during 2024, reflecting the growing role of LLM-based workloads in enterprise data operations.</p><p><strong>Prefect</strong> made a significant move in this space by launching <em>ControlFlow</em>, a framework specifically designed for AI-driven workflows and LLM integration. <strong>Prefect</strong> also integrated <em><strong>Marvin</strong></em>, an LLM-powered assistant, to simplify the creation of AI workflows.</p><p><strong>Temporal</strong> embraced <em>multi-agent workflows</em>, enabling sophisticated coordination between AI models, software applications, and human participants. </p><p>Meanwhile, <strong>Windmill</strong> took a different approach by integrating AI directly into the development experience, introducing an AI copilot to assist in flow building.</p><h2>Enhanced Resource Management &amp; Execution</h2><p>Intelligent resource management has become a critical focus for workflow engines, particularly as organisations increasingly run workflows on cloud-managed and serverless platforms. Several cloud-native engines made significant advances in this area during 2024.</p><p><strong>Temporal</strong> introduced sophisticated resource management with its worker <em><strong>auto-tuning</strong></em> feature, which automatically adjusts worker slots based on real-time CPU and memory usage. </p><p><strong>Kestra</strong> has introduced <em>task runners</em> that can dynamically offload resource-intensive tasks to on-demand compute services like Azure Batch, Google Batch, and Google Cloud Run.</p><p><strong>Dagster Pipes</strong> became stable in version 1.8 released in 2024, with enhanced integrations for Lambda, Kubernetes, and Databricks looking ahead. </p><p><strong>DolphinScheduler</strong> plans to integrate <strong>KEDA</strong> (Kubernetes Event-Driven Autoscaling), which will enable automatic worker scaling based on workload demands, further enhancing its Kubernetes-native capabilities.</p><p><strong>Prefect</strong> and <strong>Flyte</strong> expanded their back-end execution capabilities in 2024 by enhancing support for distributed computing frameworks, integrating with scalable Python execution frameworks such as <strong>Ray</strong> and <strong>Dask</strong>, enabling more efficient parallel processing and distributed task execution.</p><div><hr></div><h1>Conclusion &amp; Recommendations</h1><p>After a decade, <strong>Apache Airflow</strong> remains the most mature and widely adopted orchestration tool in the data engineering ecosystem. Its position as the market leader is reinforced by major cloud vendors - Google <strong>Cloud Composer</strong> and Amazon <strong>MWAA</strong> have both standardised on Airflow for their managed workflow services.</p><p>While Airflow faces criticism for its steep learning curve, operational overhead, and not-so friendly UX with outdated UI (though a complete revamp is planned for the upcoming version 3.0), its primary technical limitation is its focus on batch-oriented workflows, with less native support for modern dynamic workflow patterns.</p><p>For <strong>large-scale deployments</strong> managing large number of heterogeneous workflows that require a general-purpose engine with extensive operations support and a large ecosystem, Apache Airflow remains the top choice. At the Airflow Summit 2024, major companies showcased Airflow's massive scalability, with <strong><a href="https://www.astronomer.io/blog/airflow-in-action-uber/">Uber</a></strong> orchestrating 450K pipeline runs daily across 1000 teams, <strong>Stripe</strong> managing 150K tasks, and <strong>LinkedIn</strong> operating over 10K parallel DAGs.</p><p>For <strong>startups</strong>, and <strong>small to mid-sized businesses</strong> consider newer orchestration tools that offer streamlined setup and development experience through features like in-browser development environments, declarative workflow authoring, and low-code capabilities.</p><p>For <strong>dynamic</strong> and <strong>data-centric</strong> workflow orchestration, products like <strong>Prefect</strong> and <strong>Dagster</strong> excel at data-aware orchestration compared to traditional task-based schedulers.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[State of Open Source Real-Time OLAP Systems 2025]]></title><description><![CDATA[Overview of Major 2024 Trends and Emerging Technologies Shaping 2025]]></description><link>https://www.pracdata.io/p/state-of-open-source-read-time-olap-2025</link><guid isPermaLink="false">https://www.pracdata.io/p/state-of-open-source-read-time-olap-2025</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 26 Jan 2025 11:43:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Un9o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Un9o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 424w, https://substackcdn.com/image/fetch/$s_!Un9o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 848w, https://substackcdn.com/image/fetch/$s_!Un9o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 1272w, https://substackcdn.com/image/fetch/$s_!Un9o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Un9o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png" width="1336" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1336,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502926,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Un9o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 424w, https://substackcdn.com/image/fetch/$s_!Un9o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 848w, https://substackcdn.com/image/fetch/$s_!Un9o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 1272w, https://substackcdn.com/image/fetch/$s_!Un9o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c33400-d2e2-4261-bb17-c8ab1851c568_1336x944.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the forth part in the <strong>Data Landscape Trends 2024-2025</strong> series, focusing on the state of the open-source real-time OLAP database systems.</p><p>In the <strong><a href="https://practicaldataengineering.substack.com/p/the-evolution-of-business-intelligence-stack">first part</a></strong>, we explored the <strong>evolution of the BI stack</strong>; the <a href="https://practicaldataengineering.substack.com/p/the-rise-of-single-node-processing">second part</a> examined the <strong>rise of single-node processing engines</strong>; and the <strong><a href="https://practicaldataengineering.substack.com/p/zero-disk-architecture-the-future">third part</a></strong> discussed the <strong>evolution of zero-disk architecture</strong>.</p><h1>Introduction</h1><p>Real-time OLAP database systems have undergone significant development since their early development in the 2010s by companies like Yandex, Metamarkets and LinkedIn, which introduced systems such as <strong>ClickHouse</strong>, <strong>Druid</strong>, and <strong>Pinot</strong> to address the limitations of traditional MPP engines.</p><p>Initially designed for <strong>sub-second analytics</strong> on massive volumes of append-only web logs and clickstream data, these specialised databases have expanded their capabilities over the years. </p><p>New entrants like <strong>Apache Doris</strong> and <strong>StarRocks</strong> have joined the scene, aiming to bridge the gap between traditional OLAP architectures and modern MPP systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ADcc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ADcc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 424w, https://substackcdn.com/image/fetch/$s_!ADcc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 848w, https://substackcdn.com/image/fetch/$s_!ADcc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 1272w, https://substackcdn.com/image/fetch/$s_!ADcc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ADcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png" width="1456" height="1486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1486,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3894675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ADcc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 424w, https://substackcdn.com/image/fetch/$s_!ADcc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 848w, https://substackcdn.com/image/fetch/$s_!ADcc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 1272w, https://substackcdn.com/image/fetch/$s_!ADcc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F889a42a7-1d48-4f96-a3dc-34d3db2c5034_2302x2350.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This article provides a comprehensive overview of the evolving real-time OLAP ecosystem, examining the current state and advancements of major open-source OLAP engines. We'll explore:</p><ul><li><p>Background on real-time OLAP database systems  </p></li><li><p>An analysis of the current landscape and leading open-source products</p></li><li><p>Emerging trends in 2024</p></li><li><p>Major features and capabilities introduced by each product in 2024  </p></li><li><p>An assessment of key open source metrics including popularity, development activity, and community engagement</p></li><li><p>A comprehensive comparison of architectural approaches and core features</p></li><li><p>Recommendations for choosing and implementing these systems</p></li></ul><p></p><h1>Real-time OLAP Database Systems</h1><p>Real-time OLAP engines are specialised databases designed to deliver <strong>sub-second analytics</strong> performance. </p><p>They operate similarly to <em><strong>cube servers</strong></em> in traditional BI solutions, pre-computing metrics across various dimensions to enable real-time drilling and slice-and-dice analysis of data.</p><p>These engines achieve exceptional query performance through a combination of optimised storage and sophisticated indexing during data ingestion. Their architecture treats data as <em><strong>immutable</strong></em>, optimised primarily for <em><strong>append-only</strong></em> operations and segment-level (i.e. data chunks) replacements.</p><h2>Trade-offs</h2><p>While this storage paradigm is particularly well-suited for log and event workloads in <em><strong>denormalised</strong></em> format, it comes with certain trade-offs: inflexible data models, limited join capabilities, and higher ingestion latency.</p><p>Unlike modern MPP-based scalable storage systems such as Redshift, BigQuery, and Snowflake&#8212;which couldn't deliver sub-second queries due to architectural limitations&#8212;these systems prioritised query performance over conventional database features like table joins, ACID guarantees, and row-level mutations.</p><p>In recent years, ClickHouse and newer systems like Apache Doris and StarRocks are moving beyond the traditional real-time OLAP storage model by introducing support for mutable data operations and complex queries typically associated with data warehouse systems.</p><h1>Current Landscape and Major Products</h1><p>The following graph illustrates the development timeline of major open-source real-time OLAP engines, highlighting when each system was open-sourced and, where applicable, donated to the Apache Software Foundation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1GEX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1GEX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 424w, https://substackcdn.com/image/fetch/$s_!1GEX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 848w, https://substackcdn.com/image/fetch/$s_!1GEX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 1272w, https://substackcdn.com/image/fetch/$s_!1GEX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1GEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png" width="1456" height="1475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1475,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:600024,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1GEX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 424w, https://substackcdn.com/image/fetch/$s_!1GEX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 848w, https://substackcdn.com/image/fetch/$s_!1GEX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 1272w, https://substackcdn.com/image/fetch/$s_!1GEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a21f22-beaf-4637-bf0f-d20daa10602f_1880x1905.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The real-time OLAP landscape features several established products, each with its own origin story. <strong>ClickHouse</strong> was initiated at <strong>Yandex</strong> in 2010, followed by <strong>Apache Druid</strong> (developed by <strong>Metamarkets</strong> in 2011), <strong>Apache Kylin</strong> (created by <strong>eBay</strong>), and <strong>Apache Pinot</strong> (originated at <strong>LinkedIn</strong> in 2013).</p><p>More recently, the ecosystem expanded with two significant additions: <strong>Apache Doris</strong> and <strong>StarRocks</strong>. <strong>Baidu</strong> developed and open-sourced Apache Doris in 2017-2018.</p><p>StarRocks later emerged as a fork of Doris, led by former Apache Doris PMC members who sought to address what they perceived as gaps in the original project's roadmap for building modern real-time analytics systems.</p><h3>Divergence From Traditional OLAP Model</h3><p>While Apache Doris and StarRocks are classified as real-time OLAP systems, they represent a divergence from traditional OLAP approaches. </p><p>Rather than relying on <strong>immutable storage models</strong> and heavy pre-processing ad indexing methods with cube semantics, these systems bridge the gap between conventional OLAP designs and modern MPP (Massively Parallel Processing) architectures like Amazon Redshift and Google BigQuery.</p><p>Through native support for both bulk and row-level updates, alongside complex join capabilities, these systems are aiming to offer a hybrid solution that combines real-time OLAP performance with the versatility of general-purpose analytics platforms.</p><p>The ecosystem also includes proprietary solutions such as <strong>Rockset</strong> and <strong>Kinetica</strong>, though these fall outside the scope of this open-source focused analysis.</p><h1>GitHub Repository Trends</h1><p>Open-source projects are often evaluated based on key metrics such as repository stars, download counts, contributor activity, and repository engagement, including commits, releases and issues logged and resolved. </p><p>As part of my work and passion for the open-source ecosystem, I run my own <a href="https://practicaldataengineering.substack.com/p/building-data-pipeline-using-duckdb">little analytics platform</a> to collect, store, and analyse GitHub events to track year-over-year trends in data engineering-related projects. </p><h2>Project Popularity</h2><p>By using GitHub's Watch (Star) and Fork events as indicators of community interest, my analysis of 2024 data reveals that <strong>ClickHouse</strong> stands out as the clear leader. </p><p>With approximately 6,400 new stars and a significantly higher number of forks than other projects, ClickHouse outperformed its competitors, more than doubling the attraction of Apache Doris, which ranks second.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7BBq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7BBq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 424w, https://substackcdn.com/image/fetch/$s_!7BBq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 848w, https://substackcdn.com/image/fetch/$s_!7BBq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 1272w, https://substackcdn.com/image/fetch/$s_!7BBq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7BBq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png" width="1456" height="721" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:721,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:201609,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7BBq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 424w, https://substackcdn.com/image/fetch/$s_!7BBq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 848w, https://substackcdn.com/image/fetch/$s_!7BBq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 1272w, https://substackcdn.com/image/fetch/$s_!7BBq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8277c-8d35-4da1-9f57-9437cefb73c9_1899x940.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Number of Repository forks in 2024 follow similar trend:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mjKe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mjKe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 424w, https://substackcdn.com/image/fetch/$s_!mjKe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 848w, https://substackcdn.com/image/fetch/$s_!mjKe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 1272w, https://substackcdn.com/image/fetch/$s_!mjKe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mjKe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png" width="1456" height="717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:717,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195363,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mjKe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 424w, https://substackcdn.com/image/fetch/$s_!mjKe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 848w, https://substackcdn.com/image/fetch/$s_!mjKe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 1272w, https://substackcdn.com/image/fetch/$s_!mjKe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5dd699f-cf41-492d-a243-961b47b7d9cd_1897x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Apache Doris and StarRocks are having a close race to capture the attention of the industry and gain adoption as leading <em><strong>hybrid real-time OLAP and Data Warehouse engines</strong></em>. </p><p>In contrast, Apache Pinot and Apache Druid are struggling to keep pace with the likes of ClickHouse and Doris/StarRocks. The lack of significant innovations from these projects in 2024 might hinder their ability to further capture market interest.</p><h2>Code Activity</h2><p>Development activity in 2024 showed interesting patterns across projects. Apache StarRocks and Apache Doris led in pull request activity, each processing approximately 30K pull requests (opened plus closed), while ClickHouse maintained a strong third position.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nxZH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nxZH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 424w, https://substackcdn.com/image/fetch/$s_!nxZH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 848w, https://substackcdn.com/image/fetch/$s_!nxZH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 1272w, https://substackcdn.com/image/fetch/$s_!nxZH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nxZH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png" width="1456" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nxZH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 424w, https://substackcdn.com/image/fetch/$s_!nxZH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 848w, https://substackcdn.com/image/fetch/$s_!nxZH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 1272w, https://substackcdn.com/image/fetch/$s_!nxZH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd755a309-76a5-4a4d-b285-ae2fafff24b9_1651x848.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>However, ClickHouse demonstrated the highest code commitment activity, showing the most consistent push frequency and merge operations. StarRocks and Apache Doris followed with high activity levels, ranking second and third respectively.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ETg4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ETg4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 424w, https://substackcdn.com/image/fetch/$s_!ETg4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 848w, https://substackcdn.com/image/fetch/$s_!ETg4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 1272w, https://substackcdn.com/image/fetch/$s_!ETg4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ETg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png" width="1456" height="708" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:708,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:205849,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ETg4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 424w, https://substackcdn.com/image/fetch/$s_!ETg4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 848w, https://substackcdn.com/image/fetch/$s_!ETg4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 1272w, https://substackcdn.com/image/fetch/$s_!ETg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33693234-d283-4ea0-8962-ae4578062b9e_1985x965.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In contrast, 2024 saw lower levels of development activity for <strong>Druid</strong>, <strong>Pinot</strong>, and <strong>Kylin</strong>. While Druid and Pinot are mature products that naturally might require less frequent updates, their reduced code activity, along with Kylin's, still raises concerns about potential project stagnation. </p><p>This trend is particularly particularly evident in Apache Kylin, which recorded only 150 pull requests and 85 code pushes for the entire year, indicating significantly diminished development momentum. </p><p>As a result, Apache Kylin will be excluded from the majority of this study moving forward.</p><h2>User Engagement</h2><p>Repository issue activity&#8212;both opened and resolved&#8212;serves as a critical indicator of project health and community engagement in open-source projects. </p><p>Analysis of 2024 data shows ClickHouse leading in this metric with the highest volume of user-reported issues and resolutions, with StarRocks and Doris showing strong activity levels in second and third positions respectively.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FRny!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FRny!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 424w, https://substackcdn.com/image/fetch/$s_!FRny!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 848w, https://substackcdn.com/image/fetch/$s_!FRny!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 1272w, https://substackcdn.com/image/fetch/$s_!FRny!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FRny!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png" width="1456" height="770" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:770,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170015,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FRny!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 424w, https://substackcdn.com/image/fetch/$s_!FRny!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 848w, https://substackcdn.com/image/fetch/$s_!FRny!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 1272w, https://substackcdn.com/image/fetch/$s_!FRny!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4de4b90f-f70b-404e-8f15-4e5b915bc51c_2236x1183.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Project Collaboration</h2><p>ClickHouse maintains the largest contributor base with over 1,600 committers since inception, yet Apache Doris and StarRocks demonstrated superior community growth in 2024, in attracting new collaborators.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cn59!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cn59!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 424w, https://substackcdn.com/image/fetch/$s_!cn59!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 848w, https://substackcdn.com/image/fetch/$s_!cn59!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!cn59!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cn59!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png" width="1456" height="733" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:733,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cn59!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 424w, https://substackcdn.com/image/fetch/$s_!cn59!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 848w, https://substackcdn.com/image/fetch/$s_!cn59!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!cn59!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dabce5-2ff7-460f-98a0-568bd446e26b_2180x1098.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The contribution metrics also reveals significant regional concentration, with StarRocks and Doris receiving substantial backing from major Chinese technology companies including <strong>Baidu</strong>, <strong>Tencent</strong>, and <strong>Alibaba</strong>. Over <a href="https://ossinsight.io/analyze/apache/doris?vs=StarRocks%2Fstarrocks#people">80% of contributions</a> to these repositories originate from China, reflecting strong regional investment in these platforms.</p><h2>Adoption &amp; Installations</h2><p>Industry adoption and usage can be measured through download and installation metrics, with Docker Hub downloads serving as a particularly reliable indicator for database systems.</p><p>In this metric, ClickHouse demonstrates exceptional market penetration with over 100 million downloads. </p><p>Apache Pinot and Apache Druid show substantial adoption with 10 million and 5 million downloads respectively. The newer entrants, StarRocks and Doris, have achieved encouraging early traction with 500K and 100K downloads respectively.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oa4p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oa4p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 424w, https://substackcdn.com/image/fetch/$s_!oa4p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 848w, https://substackcdn.com/image/fetch/$s_!oa4p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 1272w, https://substackcdn.com/image/fetch/$s_!oa4p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oa4p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png" width="1166" height="912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:912,&quot;width&quot;:1166,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93882,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oa4p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 424w, https://substackcdn.com/image/fetch/$s_!oa4p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 848w, https://substackcdn.com/image/fetch/$s_!oa4p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 1272w, https://substackcdn.com/image/fetch/$s_!oa4p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F741e649d-40f9-4169-872b-d9a20cae11e7_1166x912.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Major 2024 Trends</h1><h2>1. Adoption of Decoupled Storage and Compute Architecture</h2><p>Distributed real-time OLAP systems have traditionally followed the shared-nothing architecture common to modern MPP-based storage systems. </p><p>However, the industry's shift towards decoupled storage and compute models has prompted Real-time OLAP engines to embrace "<a href="https://practicaldataengineering.substack.com/p/zero-disk-architecture-the-future">zero-disk architecture</a>". </p><p>This architecture which leverages deep storage solutions like HDFS and S3 as the primary persistence layer, offering enhanced scalability and flexibility while reducing operational costs.</p><p>In 2024, <strong><a href="https://www.starrocks.io/blog/separation-of-storage-and-compute-an-architecture-that-cuts-costs-and-enhances-efficiency">StarRocks</a></strong> and <strong><a href="https://doris.apache.org/blog/release-note-3.0.0/">Apache Doris</a></strong> incorporated this architectural approach into their platforms. </p><p>ClickHouse had earlier laid the groundwork for this transition with its <strong>S3-backed MergeTree</strong> tables in <strong>version 21.8</strong> (August 2021), enabling direct table storage in Amazon S3 or compatible object storage, and has since expanded its cloud offerings around this model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IIkn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IIkn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 424w, https://substackcdn.com/image/fetch/$s_!IIkn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 848w, https://substackcdn.com/image/fetch/$s_!IIkn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 1272w, https://substackcdn.com/image/fetch/$s_!IIkn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IIkn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png" width="1456" height="831" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:831,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1790255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IIkn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 424w, https://substackcdn.com/image/fetch/$s_!IIkn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 848w, https://substackcdn.com/image/fetch/$s_!IIkn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 1272w, https://substackcdn.com/image/fetch/$s_!IIkn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b3fd92d-23a6-44b2-8dea-5d087293a57f_2977x1699.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>2. Federated Analytics</h2><p>The rising adoption of data lake and lakehouse architectures has prompted major analytical storage systems to pursue seamless integration with open table formats including <strong>Hudi</strong>, <strong>Iceberg</strong>, and <strong>Delta Lake</strong>. </p><p>Beyond these table formats, the platforms are extending their capabilities to support both read and write operations for industry-standard data lake file formats such as <strong>Parquet</strong> and <strong>ORC</strong>.</p><p>Modern real-time OLAP engines have embraced this trend, evolving beyond their traditional roles into unified analytics platforms. </p><p>ClickHouse, Apache Doris, and StarRocks have implemented native federation capabilities, enabling direct querying across diverse data sources&#8212;including data warehouses, data lakes, and open table formats&#8212;without the traditional requirements of data ingestion or replication.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O_BU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O_BU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 424w, https://substackcdn.com/image/fetch/$s_!O_BU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 848w, https://substackcdn.com/image/fetch/$s_!O_BU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 1272w, https://substackcdn.com/image/fetch/$s_!O_BU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O_BU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png" width="1199" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1199,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:372213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O_BU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 424w, https://substackcdn.com/image/fetch/$s_!O_BU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 848w, https://substackcdn.com/image/fetch/$s_!O_BU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 1272w, https://substackcdn.com/image/fetch/$s_!O_BU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0072cee5-75da-49e4-9218-1ff8e76b876a_1199x906.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>3. Real-Time Data Warehouse</h2><p>A significant trend in the ecosystem is the drive towards delivering "<strong>real-time data warehouse</strong>" capabilities.</p><p>This advancement enables <em><strong>ad-hoc analytical queries without pre-aggregation</strong></em>, supporting complex joins across multiple datasets&#8212;a functionality that has traditionally challenged real-time OLAP storage models.</p><p><strong>Doris</strong> and <strong>StarRocks</strong> are leading this transformation by combining the strengths of flexible MPP engines (exemplified by Redshift, Snowflake, and BigQuery) with the real-time analytical capabilities of OLAP systems. Their hybrid approach achieves a balance of speed, flexibility, and scalability.</p><p><strong>ClickHouse</strong> has also embraced this direction, enhancing its core engine to support broader OLAP workloads through improved update operations and enhanced join capabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xft4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xft4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 424w, https://substackcdn.com/image/fetch/$s_!Xft4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 848w, https://substackcdn.com/image/fetch/$s_!Xft4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!Xft4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xft4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png" width="1206" height="1082" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1082,&quot;width&quot;:1206,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1170331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xft4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 424w, https://substackcdn.com/image/fetch/$s_!Xft4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 848w, https://substackcdn.com/image/fetch/$s_!Xft4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!Xft4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F094ed457-bc25-4c5d-a6be-ece722045835_1206x1082.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1></h1><h1>Major Features and Improvements Introduced in 2024</h1><p>Below is an analysis of major features and enhancements introduced by each platform in 2024:</p><p><strong>ClickHouse</strong></p><ul><li><p><strong>Refreshable Materialised Views</strong>: Enables periodic on-demand full recomputation and refresh of materialised views.</p></li><li><p><strong>Remote File Caching</strong>: Enhances efficiency for distributed workloads by caching remote files, significantly reducing access times.</p></li><li><p><strong>JSON Data Type</strong>: Introduced in ClickHouse v24.8, this new data type improves handling of semi-structured data.</p></li></ul><p><strong>StarRocks</strong></p><ul><li><p><strong>Shared-Data Clusters</strong>: Introduced in version 3.3, providing a decoupled storage and compute architecture for zero-disk capability using deep storage as the primary persistence layer.</p></li><li><p> <strong>Pipe Service</strong>: Automates continuous loading of data files (e.g., Parquet) from deep storage services like S3 and HDFS into the StarRocks engine.</p></li><li><p><strong>Unified Catalog</strong>: Enables query federation over lakehouse tables, supporting direct queries on open table formats, with integration with Hive Metastore or AWS Glue for data discovery.</p></li></ul><p><strong>Apache Doris</strong></p><ul><li><p><strong>Decoupled Storage and Compute Architecture</strong>: Introduced in <strong>Doris 3.0.3</strong> (December 2024), supporting S3-compatible object storage for cloud-native data persistence.</p></li><li><p><strong>Data Write-Back</strong>: Enables DDL and DML functions such as creating tables and writing data to Hive and Iceberg tables directly through Doris.</p></li><li><p><strong>Transaction Support</strong>: Adds CRUD operations like <code>INSERT INTO SELECT</code>, <code>DELETE</code>, and <code>UPDATE</code>.</p></li><li><p><strong>Materialised Views</strong>: Introduced asynchronous and multi-table materialised views.</p></li><li><p><strong>Semi-Structured Data Support</strong>: Enhanced with the new VARIANT data type.</p></li><li><p>Expanded <strong>Data Lake Integration</strong> capabilities.</p></li></ul><p> <strong>Apache Druid</strong></p><ul><li><p><strong>Version 30.0 Enhancements</strong>: Better ingestion experiences for Amazon Kinesis, Apache Kafka, Delta Lake, and improved integrations with Google Cloud Storage and Azure Blob Storage.</p></li><li><p><strong>Centralised Schema Management</strong>: Speeds up schema operations and cluster startup by gathering segment metadata.</p></li><li><p><strong>Version 31.0 (Dart Query Engine)</strong>: Introduced a new query engine supporting complex workloads like large joins and high-cardinality <code>GROUP BY</code>, expanding Druid's capabilities into MPP (Massively Parallel Processing) territories.</p></li></ul><p><strong>Apache Pinot</strong></p><ul><li><p><strong>Multi-Stage Query Engine</strong>: Enhanced for better performance and scalability.</p></li><li><p><strong>Upsert and Compaction Improvements</strong>: Optimised for data ingestion workflows.</p></li><li><p><strong>Semi-Structured Data Support</strong>: Improved handling of JSON data.</p></li><li><p><strong>Delta Lake Integration</strong>: Added support for the Delta Kernel library, enabling integration with Delta Lake.</p></li></ul><p></p><h2>Major Vendor announcements and features</h2><p>The following table lists key developments across SaaS vendors supporting these open-source platforms, including major announcements, strategic partnerships, and product expansions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D2u4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D2u4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 424w, https://substackcdn.com/image/fetch/$s_!D2u4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 848w, https://substackcdn.com/image/fetch/$s_!D2u4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 1272w, https://substackcdn.com/image/fetch/$s_!D2u4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D2u4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png" width="1245" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1245,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:193703,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D2u4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 424w, https://substackcdn.com/image/fetch/$s_!D2u4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 848w, https://substackcdn.com/image/fetch/$s_!D2u4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 1272w, https://substackcdn.com/image/fetch/$s_!D2u4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0dff4c6-f85d-43d5-b99b-58d04b17f12a_1245x938.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Features Comparison</h1><p>The real-time OLAP systems has been extensively documented, with numerous comparisons of the top three contenders&#8212;ClickHouse, Druid, and Pinot. </p><p>Roman Leventov's comprehensive <a href="https://leventov.medium.com/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7">2018 analysis</a> stands as a notable reference point, detailing their architectural differences and capabilities. However, these platforms have evolved significantly in recent years.</p><p>The following sections provide a detailed comparison of these engines across key architectural and functional categories, reflecting their current capabilities and distinctions.</p><h2>1. System Architecture</h2><p>These platforms all build upon a distributed <em><strong>shared-nothing architecture</strong></em>. ClickHouse, Doris, and StarRocks have extended this model to support <em><strong>shared-storage (zero-disk)</strong></em> configurations through decoupled storage and compute capabilities.</p><p>From a software architecture perspective, <strong>ClickHouse, Doris, and StarRocks</strong> adopt a more simplified architecture compared to the relatively complex implementations of <strong>Apache Druid</strong> and <strong>Apache Pinot</strong>. </p><p>Doris and StarRocks are particularly notable for their <em><strong>self-contained</strong></em> design, operating independently of external system dependencies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-LuV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-LuV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 424w, https://substackcdn.com/image/fetch/$s_!-LuV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 848w, https://substackcdn.com/image/fetch/$s_!-LuV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 1272w, https://substackcdn.com/image/fetch/$s_!-LuV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-LuV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png" width="1456" height="1086" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1086,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:262380,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-LuV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 424w, https://substackcdn.com/image/fetch/$s_!-LuV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 848w, https://substackcdn.com/image/fetch/$s_!-LuV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 1272w, https://substackcdn.com/image/fetch/$s_!-LuV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35b4e2bc-3789-479c-821c-46021f433ff8_1537x1146.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>2. Data Architecture</h2><p>All products implement <em><strong>columnar</strong></em> storage systems, with Doris and StarRocks recently extending their capabilities to support <em><strong>row-based</strong></em> storage modes. </p><p>Druid and Pinot distinguish themselves through their <em><strong>time-series-oriented</strong></em> data model, which leverages timestamp columns as primary partition fields. These two engines also excel in supporting <em><strong>hybrid workloads</strong></em> (real-time and batch) with flexible segmentation granularity.</p><p>Regarding transactional capabilities, Doris and StarRocks lead the ecosystem in ACID compliance, followed closely by ClickHouse. These platforms provide good support for primary key uniqueness, atomicity and concurrency control.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ApjU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ApjU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 424w, https://substackcdn.com/image/fetch/$s_!ApjU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 848w, https://substackcdn.com/image/fetch/$s_!ApjU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 1272w, https://substackcdn.com/image/fetch/$s_!ApjU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ApjU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png" width="1456" height="2081" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2081,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:433833,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ApjU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 424w, https://substackcdn.com/image/fetch/$s_!ApjU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 848w, https://substackcdn.com/image/fetch/$s_!ApjU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 1272w, https://substackcdn.com/image/fetch/$s_!ApjU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ea43a4d-fe0a-4f82-a0d0-33f7ac36622a_1563x2234.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>3. Query and Materialised Views</h2><p>Materialised view support is currently limited to ClickHouse, Doris, and StarRocks. </p><p>These engines offer both <em><strong>synchronous</strong></em> <em><strong>materialised views</strong></em> which automatically update when source table data changes, and <em><strong>asynchronous materialised views</strong></em> which can be recomputed on demand or through scheduled jobs.</p><p>Doris and StarRocks extend this functionality with advanced features including <em><strong>multi-table materialised views</strong></em> and automatic query rewriting capabilities. </p><p>They also demonstrate superior complex join support, while ClickHouse provides moderate join capabilities. Druid and Pinot, however, are restricted to joins with small, dedicated dimension tables.</p><p>In terms of query processing capabilities, ClickHouse, Doris, and StarRocks maintain their leadership through sophisticated optimisation techniques, including <em><strong>cost-based optimization (CBO)</strong></em> and <em><strong>vectorised processing</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DjRP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DjRP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 424w, https://substackcdn.com/image/fetch/$s_!DjRP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 848w, https://substackcdn.com/image/fetch/$s_!DjRP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 1272w, https://substackcdn.com/image/fetch/$s_!DjRP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DjRP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png" width="1456" height="861" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:861,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146558,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DjRP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 424w, https://substackcdn.com/image/fetch/$s_!DjRP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 848w, https://substackcdn.com/image/fetch/$s_!DjRP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 1272w, https://substackcdn.com/image/fetch/$s_!DjRP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ae8a0a-26fe-4ac7-84d9-7bf72054b13f_1545x914.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>4. Data Ingestion</h2><p>Real-time OLAP engines support two fundamental data ingestion modes: <strong>batch ingestion</strong> for processing data from sources like data lakes, and <strong>streaming ingestion</strong> for handling continuous data flows from platforms such as Apache Kafka.</p><p>These engines primarily employ <strong>ETL (Extract, Transform, Load)</strong> for data loading, performing transformations during ingestion to align with their immutable data model. </p><p>This approach particularly suits Druid, Pinot, and ClickHouse, which are optimised for <em><strong>denormalised</strong></em> and <em><strong>pre-aggregated</strong></em> data. The alternative <strong>ELT (Extract, Load, Transform)</strong> approach, which prioritises raw data loading, sees limited adoption in these systems.</p><p>For mutable data operations, Pinot, ClickHouse, Doris, and StarRocks provide support for <em><strong>upserts</strong></em> and primary-key-level <em><strong>row deduplication</strong></em>. </p><h3>Using External Compute Frameworks</h3><p>While offering native batch ingestion capabilities, these platforms integrate with external processing frameworks&#8212;including <strong>Hadoop</strong> (Druid, Pinot), <strong>Spark</strong>, and <strong>Apache Flink</strong>&#8212;to offload complex data transformation and computation during ingestion. </p><p>They also support <em><strong>push-based ingestion</strong></em> through frameworks like <strong>Kafka Connect</strong> and <strong>Flink CDC Connect,</strong> which enable data ingestion via custom-built connectors.</p><h3>External storage support</h3><p>All platforms effectively handle data lake file formats (<strong>CSV</strong>, <strong>ORC</strong>, <strong>Parquet</strong>), with ClickHouse, Doris, and StarRocks further extending support to major open table formats in <strong>data lakehouse</strong> architectures&#8212;a capability where Druid and Pinot currently lag.</p><p>In terms of log-based <strong>Change Data Capture (CDC)</strong> ingestion, ClickHouse and StarRocks offer comprehensive support, though ClickHouse's CDC solutions typically require subscription-based ingestion services.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZyqU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZyqU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 424w, https://substackcdn.com/image/fetch/$s_!ZyqU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 848w, https://substackcdn.com/image/fetch/$s_!ZyqU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 1272w, https://substackcdn.com/image/fetch/$s_!ZyqU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZyqU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png" width="1456" height="1972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1972,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:609304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZyqU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 424w, https://substackcdn.com/image/fetch/$s_!ZyqU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 848w, https://substackcdn.com/image/fetch/$s_!ZyqU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 1272w, https://substackcdn.com/image/fetch/$s_!ZyqU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5b4946-5c02-4b84-8eae-0c65e31b828c_1554x2105.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>5. Query Federation &amp; Interfaces</h2><p>As highlighted earlier, A key advancement in modern OLAP systems is their integration with external storage systems, enabling <strong>direct data querying without prior ingestion</strong>. </p><p>Among the five open-source platforms, ClickHouse, Apache Doris, and StarRocks lead with advanced query federation capabilities. These platforms support major external storage systems including data lakes, lakehouses, and DBMSs. </p><p>They enable <em><strong>external table</strong></em> definitions over files and directories in data lakes hosted on HDFS and S3-compatible platforms, and facilitate direct querying of open table formats like <strong>Hudi</strong>, <strong>Iceberg</strong>, and <strong>Delta Lake</strong>. Doris and StarRocks extend this support to include <strong>Apache Paimon</strong>.</p><h3>Data Write-back</h3><p>For data export and write-back operations, Doris and StarRocks support writing data to <strong>Hive</strong> and <strong>Iceberg</strong>, while ClickHouse mainly supports <strong>MySQL</strong> and <strong>Postgres</strong>. </p><p>Doris and StarRocks also enhance their data discovery capabilities through integration with external metadata services such as <strong>Hive Metastore</strong> and <strong>AWS Glue</strong>.</p><p>In terms of broader ecosystem integration, ClickHouse, Druid, and Pinot demonstrate comprehensive support. These major compute frameworks including <strong>Spark</strong>, <strong>Presto</strong>, <strong>Trino</strong>, and <strong>Hive</strong>, have integrated with these engines enabling direct data access. </p><p>All platforms provide moderate support for BI tool integration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ByGy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ByGy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 424w, https://substackcdn.com/image/fetch/$s_!ByGy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 848w, https://substackcdn.com/image/fetch/$s_!ByGy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 1272w, https://substackcdn.com/image/fetch/$s_!ByGy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ByGy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png" width="1456" height="1634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1634,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:476917,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ByGy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 424w, https://substackcdn.com/image/fetch/$s_!ByGy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 848w, https://substackcdn.com/image/fetch/$s_!ByGy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 1272w, https://substackcdn.com/image/fetch/$s_!ByGy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ca139c-4072-45b7-9f26-fddf0ae43307_1546x1735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Recommendations</h1><p>Each product in the current OLAP market offers unique advantages derived from its original design goals and subsequent feature developments. </p><p>While this general recommendation guide serves as a starting point, a comprehensive evaluation across multiple criteria is essential when comparing and selecting a product.</p><p><strong>Small-to-medium Deployments:</strong></p><p>Overall <strong>ClickHouse</strong> is an excellent real-time OLAP engine suitable for small-to-medium environments. Its straightforward deployment, management, and architecture make it the preferred choice for general use cases.</p><p><strong>Large On-Premise Deployments:</strong></p><p>For large-scale implementations, particularly on Hadoop or similar platforms, <strong>ClickHouse</strong>, <strong>Pinot</strong>, and <strong>Druid</strong> are leading candidates. The final selection should align with specific workload requirements and use cases.</p><p><strong>Cloud-Native Implementations:</strong></p><p>Cloud-native deployments utilising object storage as the main persistence layer can leverage managed solutions like <strong>ClickHouse Cloud</strong>, or platforms such as <strong>StarRocks</strong> and <strong>Doris</strong>. However, consider that StarRocks and Doris introduced their decoupled architecture recently, suggesting careful evaluation for production readiness.</p><p><strong>Log Analytics &amp; Time-series Data:</strong></p><p><strong>Druid</strong> and <strong>Pinot</strong> demonstrate particular strength in processing immutable time-series data, including web logs, machine logs, and clickstream events. Their support for hybrid tables makes them ideal for Lambda-style architectures.</p><p><strong>Unified Analytics with Query Federation:</strong></p><p><strong>ClickHouse</strong>, <strong>StarRocks</strong>, and <strong>Doris</strong> excel in unified analytics scenarios, offering query federation capabilities that enable seamless data access across diverse sources such as data lakes, lakehouses and DBMS systems.</p><p><strong>Hybrid Data Warehouse-OLAP Solutions:</strong></p><p> <strong>StarRocks</strong> and <strong>Doris</strong> provide a middle ground, combining traditional data warehouse capabilities with real-time OLAP performance. They offer comprehensive CRUD operations, complex join support (including star schema), and ACID guarantees to some extend.</p><h1>Conclusion</h1><p>The real-time OLAP ecosystem has evolved from specialised engines for append-only data processing into versatile analytical platforms. While ClickHouse maintains its leadership position in general-purpose deployments, newer platforms like StarRocks and Apache Doris are bridging the gap between real-time analytics and data warehouse capabilities. </p><p>The adoption of decoupled architectures and unified analytics suggests continuing evolution in this space. Organisations should evaluate their specific requirements across performance, scalability, and integration needs while considering platform maturity and community support when making their selection from these open source systems.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[Zero-Disk Architecture: The Future of Cloud Storage Systems]]></title><description><![CDATA[Data Landscape Trends: 2024-2025 Series]]></description><link>https://www.pracdata.io/p/zero-disk-architecture-the-future</link><guid isPermaLink="false">https://www.pracdata.io/p/zero-disk-architecture-the-future</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Thu, 16 Jan 2025 10:33:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6f49ca93-5e08-4399-9414-428d78fbd4e8_1662x1563.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3yE3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3yE3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 424w, https://substackcdn.com/image/fetch/$s_!3yE3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 848w, https://substackcdn.com/image/fetch/$s_!3yE3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 1272w, https://substackcdn.com/image/fetch/$s_!3yE3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3yE3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png" width="728" height="514.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1436108,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3yE3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 424w, https://substackcdn.com/image/fetch/$s_!3yE3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 848w, https://substackcdn.com/image/fetch/$s_!3yE3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 1272w, https://substackcdn.com/image/fetch/$s_!3yE3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e0fa8cf-070d-4def-9bf1-07c9a4e2361d_2672x1889.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the third part in the <strong>Data Landscape Trends 2024-2025</strong> series, focusing on the evolution of <strong>zero-disk architecture</strong>.</p><p>In the <strong><a href="https://practicaldataengineering.substack.com/p/the-evolution-of-business-intelligence-stack">first part</a></strong>, we explored the <strong>evolution of the BI stack</strong>, while the <strong><a href="https://practicaldataengineering.substack.com/p/the-rise-of-single-node-processing">second part</a></strong> examined the <strong>rise of single-node processing engines</strong>.</p><h1>Introduction</h1><p>The landscape of storage systems is continuously undergoing significant transformations. One major trend in this evolution is the emergence of cloud-native and "<strong>zero-disk</strong>" architectures for storage systems.</p><p>This shift represents a move away from traditional storage systems that rely on locally attached physical storage devices such as HDD, SSD or attached EBS volumes on cloud, towards designs that use remote, scalable object storage services as their primary persistence layer.</p><p>Understanding this transformation is crucial for data-intensive businesses as they navigate the future of data infrastructure, particularly as storage costs often account for a significant portion of total cloud infrastructure expenses.</p><p>In this article, I will explore the evolution of zero-disk architecture, studying:</p><ul><li><p>The historical context and limitations of traditional storage architectures.</p></li><li><p>The emergence of disaggregated storage and its evolution in cloud environments.</p></li><li><p>Economic drivers and technical trade-offs of zero-disk implementations.</p></li><li><p>Current implementation patterns across different use cases.</p></li><li><p>Future directions and emerging solutions.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lh2a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lh2a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 424w, https://substackcdn.com/image/fetch/$s_!Lh2a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 848w, https://substackcdn.com/image/fetch/$s_!Lh2a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 1272w, https://substackcdn.com/image/fetch/$s_!Lh2a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lh2a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png" width="1456" height="1369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4441668,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lh2a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 424w, https://substackcdn.com/image/fetch/$s_!Lh2a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 848w, https://substackcdn.com/image/fetch/$s_!Lh2a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 1272w, https://substackcdn.com/image/fetch/$s_!Lh2a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf727427-18b0-41b6-b5e1-3cb1adf73d9e_2494x2345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1>Evolution of Storage Architecture</h1><p></p><h2>Traditional Approaches</h2><p>Before the emergence of disaggregated storage systems, in addition to popular single-node disk-based database systems like MySQL and PostgreSQL, two dominant distributed storage architectures emerged: <strong>shared-nothing architecture</strong> and <strong>shared-disk architecture</strong>. Each approach presented unique advantages and challenges which we will explore further.</p><h3>Shared-Disk Architecture</h3><p>Early distributed storage systems were built with shared-disk architecture that leverages multiple nodes with independent compute resources (CPU and RAM) while sharing access to a common storage system connected via a network.</p><p>While shared-disk architecture provides a simplified data management, this architecture has significant challenges, such as I/O contention, which can become a critical bottleneck during large I/O-intensive ETL workloads due to the shared nature of the I/O subsystem across all nodes in the cluster.</p><p>Furthermore, shared-disk systems often necessitate specialised hardware and networking setups, such as Network Attached Storage (NAS) or Storage Area Networks (SAN), which add to the overall cost and complexity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VSJ1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VSJ1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 424w, https://substackcdn.com/image/fetch/$s_!VSJ1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 848w, https://substackcdn.com/image/fetch/$s_!VSJ1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 1272w, https://substackcdn.com/image/fetch/$s_!VSJ1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VSJ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png" width="888" height="1059" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1059,&quot;width&quot;:888,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:740449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VSJ1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 424w, https://substackcdn.com/image/fetch/$s_!VSJ1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 848w, https://substackcdn.com/image/fetch/$s_!VSJ1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 1272w, https://substackcdn.com/image/fetch/$s_!VSJ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F461e30e8-6eba-40f5-a50e-f0ba88ebde1c_888x1059.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Shared-Disk Architecture</figcaption></figure></div><p></p><p>This design was frequently employed in traditional data warehouses, such as <strong>Oracle RAC</strong>, to distribute workloads efficiently across multiple servers.</p><h3>Shared-Nothing Architecture</h3><p>The shared-nothing architecture, notably outlined by <strong>Michael Stonebraker</strong> in a <a href="https://dsf.berkeley.edu/papers/hpts85-nothing.pdf">published paper </a>in 1986, takes a different approach.</p><p>It distributes the system across multiple nodes, where each node operates independently with its own CPU, memory, and disk. Unlike shared-disk systems, nodes in a shared-nothing setup do not share resources, making this architecture highly scalable and efficient for distributed workloads.</p><p>One of the key advantages of shared-nothing architecture is its <strong>near-linear scalability</strong>. Systems can grow seamlessly by adding more nodes, enabling them to handle increased workloads effectively. Additionally, shared-nothing systems are cost-effective, as they rely on commodity hardware rather than expensive, specialised equipment.</p><p>However, shared-nothing systems are not without challenges. Managing data distribution is a key hurdle, as hotspots and data skews require careful partitioning and load balancing to ensure optimal performance. Similarly, auto-scaling can be complex and costly, often requiring disruptive operations such as repartitioning and leader-follower replication.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RmQz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RmQz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 424w, https://substackcdn.com/image/fetch/$s_!RmQz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 848w, https://substackcdn.com/image/fetch/$s_!RmQz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 1272w, https://substackcdn.com/image/fetch/$s_!RmQz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RmQz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png" width="1341" height="934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:934,&quot;width&quot;:1341,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:849976,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RmQz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 424w, https://substackcdn.com/image/fetch/$s_!RmQz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 848w, https://substackcdn.com/image/fetch/$s_!RmQz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 1272w, https://substackcdn.com/image/fetch/$s_!RmQz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed781569-9dd5-4f7e-8712-70a3ed908c15_1341x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Shared-Nothing Architecture</figcaption></figure></div><p></p><p>Despite these challenges, shared-nothing architectures have become the backbone of modern disk-based distributed systems like <strong>Apache Kafka</strong>, <strong>Apache Cassandra</strong> and <strong>DynamoDB</strong>. They are also widely used in clustered data warehouses like <strong>Teradata</strong> and the initial design of <strong>Amazon Redshift</strong>.</p><p></p><h2>The Emergence of Disaggregated Storage</h2><p>The journey towards zero-disk architectures and the concept of disaggregated storage and compute began to gain traction with the rise of <strong>Hadoop</strong> and its associated ecosystem in the mid 2000s.</p><p>Hadoop's <strong>Distributed File System (HDFS)</strong> provided a novel approach to data storage, using commodity hardware to create a highly scalable distributed storage layer with built-in redundancy and fault tolerance.</p><p>This architecture laid the foundation for systems that decouple distributed storage from compute, enabling scalability and flexibility. Distributed compute frameworks like <strong>MapReduce</strong> and <strong>Spark</strong> operate independently of storage units, processing data stored in HDFS via high-throughput networks.</p><p>This architectural pattern evolved into a formalised data architecture, giving rise to modern data lake and, more recently, data lakehouse systems. These systems featured fully decoupled storage and compute layers, contrasting with the tightly integrated approach of traditional monolithic database systems:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mmmO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mmmO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 424w, https://substackcdn.com/image/fetch/$s_!mmmO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 848w, https://substackcdn.com/image/fetch/$s_!mmmO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 1272w, https://substackcdn.com/image/fetch/$s_!mmmO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mmmO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png" width="1456" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1371603,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mmmO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 424w, https://substackcdn.com/image/fetch/$s_!mmmO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 848w, https://substackcdn.com/image/fetch/$s_!mmmO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 1272w, https://substackcdn.com/image/fetch/$s_!mmmO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd188db-811d-4ccc-a735-0a12b9998bc4_1628x767.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">From Monolithic to Desaggregated Storage &amp; Compute Architecture</figcaption></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h3>Early Zero-Disk Systems</h3><p>A new class of storage systems, including <strong>HBase</strong>, <strong>Solr</strong>, <strong>Hive</strong>, and <strong>Ignite</strong>, emerged within the Hadoop ecosystem using this paradigm, leveraging HDFS as their primary storage abstraction.</p><p>This new paradigm presented a new database architecture pattern:</p><div class="pullquote"><p> Beyond the key advantage of decoupling compute and storage&#8212;allowing each to scale independently&#8212;these disaggregated storage systems could offload complex low-level storage tasks, such as disk management, data replication, and durability, to a dedicated deep storage service.</p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GLy6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GLy6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 424w, https://substackcdn.com/image/fetch/$s_!GLy6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 848w, https://substackcdn.com/image/fetch/$s_!GLy6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!GLy6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GLy6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png" width="1456" height="1136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1136,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2370188,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GLy6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 424w, https://substackcdn.com/image/fetch/$s_!GLy6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 848w, https://substackcdn.com/image/fetch/$s_!GLy6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!GLy6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7078f481-2c82-40be-80a1-2205f57769a1_1663x1298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hadoop-based Zero-Disk Storage Systems</figcaption></figure></div><p></p><p>Furthermore, this approach acknowledges that storage and compute requirements often grow at different rates. In practice, storage needs tend to increase rapidly as organisations collect more data, while compute demands may remain relatively stable, especially if the nature of data analysis remains unchanged.</p><p></p><h2>The Rise of Cloud Storage</h2><p>Hadoop-based decoupled storage systems revolutionised data infrastructure by separating compute and storage layers. However, managing these systems introduced significant challenges due to the operational complexity of running and maintaining large Hadoop clusters in data centers.</p><p>Organisations needed to hire highly sought-after Hadoop experts, a resource that was scarce and costly. While large tech companies like Yahoo, LinkedIn, and Twitter could afford this, smaller organisations and non-tech companies often faced difficulties in scaling and operating Hadoop effectively.</p><div><hr></div><p><em>In a petabyte-scale on-premise Hadoop platform, we have operational challenges as part of routine activities. These include managing regular disk failures and replacements, rebalancing data across nodes when new nodes are added or removed, and addressing performance issues caused by slow data nodes as disk utilisation approaches critical thresholds (typically around 80%). Additionally, maintaining and scaling the Name Node master services as workloads and data volumes grow is a critical task.</em></p><div><hr></div><h3> Emergence of Amazon S3 Storage</h3><p>The emergence of Amazon S3 and the growing momentum of cloud adoption in 2010s provided a transformative alternative. </p><p>Cloud object storage services like <strong>Amazon S3</strong> offered a simpler, more scalable, and cost-effective solution for building big data applications, gradually replacing Hadoop HDFS in many scenarios.</p><p>In addition to eliminating the operational complexities of managing and scaling HDFS, cloud storage offers several key advantages:</p><ul><li><p><strong>High Availability</strong>: Cloud storage providers typically guarantee up to 99.99% availability, whereas on-premises HDFS systems often experience regular downtimes due to maintenance, crashes, and hardware or software upgrades.</p></li><li><p><strong>Elasticity</strong>: Cloud storage removes the need for advance capacity planning, which often results in inaccurate resource estimates, and eliminates upfront hardware procurement costs.</p></li><li><p><strong>Unlimited Scalability</strong>: Cloud storage provides virtually infinite scalability without the need for regular capacity planning and hardware procurement.</p></li><li><p><strong>Multi-Tier Storage</strong>: Cloud storage offers different storage classes with seamless data migration between them, enabling cost optimisation.</p></li><li><p><strong>Cost Efficiency</strong>: The total cost of ownership for cloud storage, such as Amazon S3, is generally lower than on-premises storage like HDFS, which incurs a 3x replication overhead.</p></li><li><p><strong>True Zero-Disk Architecture</strong>: Cloud storage supports running multiple permanent or ephemeral compute clusters on a shared storage infrastructure, enabling a truly zero-disk architecture.</p></li></ul><p> The success of decoupled storage and compute architectures in Hadoop platforms and the benefits of the new cloud storage inspired industry to replicate the same architecture on cloud. </p><p>Instead of relying on HDFS, new systems would leverage S3 as the primary storage backend, paired with cloud compute for processing power. This innovation gave birth to what we now recognise as cloud-native architectures.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J9RL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J9RL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 424w, https://substackcdn.com/image/fetch/$s_!J9RL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 848w, https://substackcdn.com/image/fetch/$s_!J9RL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 1272w, https://substackcdn.com/image/fetch/$s_!J9RL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J9RL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png" width="1456" height="715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1905422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J9RL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 424w, https://substackcdn.com/image/fetch/$s_!J9RL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 848w, https://substackcdn.com/image/fetch/$s_!J9RL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 1272w, https://substackcdn.com/image/fetch/$s_!J9RL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7bc70d-aec1-41bf-a786-01813ff767bb_1980x973.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Evolution of Zero-Disk Architecture</figcaption></figure></div><p></p><p><strong>Netflix</strong> was among the first adopters to replace HDFS with Amazon S3 on its Hadoop clusters running on AWS in the early 2010s. They utilised S3 as the central data lake storage, while Hive and Spark served as the compute engine operating on Amazon EMR clusters.</p><p></p><h3>The Rise of Cloud-Native Data Vendors</h3><p>The launch of two major cloud data vendors, <strong>Databricks</strong> in 2013 and <strong>Snowflake</strong> in 2014, both embracing the decoupled compute and storage model with cost-effective cloud storage as the foundation, signaled further solidification of the new storage paradigm.</p><p>Snowflake, in particular, emerged as an early pioneer in implementing a fully decoupled, commercial cloud-native data warehouse. Their <strong><a href="https://dl.acm.org/doi/10.1145/2882903.2903741">influential paper</a></strong> published in 2016, introduced key innovations that redefined cloud-based data warehousing, demonstrating how object storage systems like Amazon S3 could serve as the primary storage layer for durability and scalability.</p><p>The paper also introduced a novel multi-cluster shared data architecture, enabling diverse workloads&#8212;analytics, reporting, and ETL&#8212;to run as separate compute clusters on a single shared data platform.</p><p>This approach, known as "<em><strong>virtual warehouses</strong></em>" in Snowflake's terminology, proved highly influential, inspiring other vendors like Amazon Redshift to adopt similar design.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NhoD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NhoD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 424w, https://substackcdn.com/image/fetch/$s_!NhoD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 848w, https://substackcdn.com/image/fetch/$s_!NhoD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 1272w, https://substackcdn.com/image/fetch/$s_!NhoD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NhoD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png" width="1456" height="804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1271814,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NhoD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 424w, https://substackcdn.com/image/fetch/$s_!NhoD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 848w, https://substackcdn.com/image/fetch/$s_!NhoD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 1272w, https://substackcdn.com/image/fetch/$s_!NhoD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F096d6d0b-bd46-4d2b-a0b4-804c4a9ade25_1473x813.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Multi-Cluster Shared Storage Architecture</figcaption></figure></div><p></p><h1>The Zero-Disk Paradigm</h1><p>The rise of cloud-native big data processing, coupled with the maturation of cloud object storage as a universal storage backbone, has driven a shift towards a true 'zero-disk' architecture.</p><div class="pullquote"><p>In a full zero-disk architecture, compute units or workers operate as <strong>stateless entities</strong> without attached storage. Persistent data is stored in <strong>cloud object storage</strong>, which offers high <strong>durability</strong>, <strong>availability</strong>, and <strong>scalability</strong> out-of-the-box. This decoupling simplifies auto scaling as compute demand fluctuates, enabling virtually <strong>infinite scalability</strong> for both storage and compute.</p></div><p>Unlike disk-based distributed systems that rely on shared-nothing architectures&#8212;requiring complex operations like load balancing and partition reassignment&#8212;zero-disk systems eliminate such overhead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C6RG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C6RG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 424w, https://substackcdn.com/image/fetch/$s_!C6RG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 848w, https://substackcdn.com/image/fetch/$s_!C6RG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 1272w, https://substackcdn.com/image/fetch/$s_!C6RG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C6RG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png" width="1370" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:1370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:590280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C6RG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 424w, https://substackcdn.com/image/fetch/$s_!C6RG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 848w, https://substackcdn.com/image/fetch/$s_!C6RG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 1272w, https://substackcdn.com/image/fetch/$s_!C6RG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b17c95f-af7b-4c7a-b639-301094031846_1370x615.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Infinite Scalability in Zero-Disk Architecture</figcaption></figure></div><p></p><p>Before exploring the current landscape of zero-disk database systems further, let's examine the some of the key trade-offs in disk-based vs cloud-native zero-disk architecture.</p><h2>The Economics of Zero-Disk Architecture</h2><p>A key driver for adopting zero-disk architecture is the potential for substantial cost reduction.</p><p>From a storage cost perspective, local SSDs attached to VMs and even durable EBS volumes are significantly more expensive than cloud object storage. For instance, storing three replicas on attached storage can cost <strong><a href="https://www.warpstream.com/blog/kafka-is-dead-long-live-kafka">10-20x more per GB</a></strong> compared to using cloud object storage.</p><p>Beyond the cost per GB, disk-based storage systems face additional expenses from <strong>cross-availability zone (AZ) data transfer fees</strong> when deployed in high availability mode with replication across multiple AZs. These inter-zone networking costs can significantly inflate operational budgets.</p><p>As <strong><a href="https://www.warpstream.com/blog/kafka-is-dead-long-live-kafka">Claimed</a></strong> by <strong>WarpStream</strong>, a cloud-native alternative to Kafka, and acknowledged by <strong><a href="https://www.warpstream.com/blog/warpstream-benchmarks-and-tco">Redpanda</a></strong> and <strong><a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters">Confluent</a>,</strong> infrastructure costs for running such systems in the cloud can account for <strong>70% to 90%</strong> of total infrastructure and workload expenses. This is largely due to inter-zone networking overhead for replicating data between Availability Zones.</p><p>In contrast, distributed zero-disk systems like <strong>WarpStream</strong> leverage object storage (e.g., Amazon S3) as the backbone, claiming to reduce total cost of ownership (TCO) by 5-10x compared to Kafka.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wvqE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wvqE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 424w, https://substackcdn.com/image/fetch/$s_!wvqE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 848w, https://substackcdn.com/image/fetch/$s_!wvqE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 1272w, https://substackcdn.com/image/fetch/$s_!wvqE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wvqE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png" width="1456" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;cost-breakdown-with-tiered-storage&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="cost-breakdown-with-tiered-storage" title="cost-breakdown-with-tiered-storage" srcset="https://substackcdn.com/image/fetch/$s_!wvqE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 424w, https://substackcdn.com/image/fetch/$s_!wvqE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 848w, https://substackcdn.com/image/fetch/$s_!wvqE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 1272w, https://substackcdn.com/image/fetch/$s_!wvqE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcd98d5-e6df-4671-908e-de74b9a6ee9b_1600x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source <a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">link</a></figcaption></figure></div><p>For example, WarpStream <strong><a href="https://www.warpstream.com/blog/minimizing-s3-api-costs-with-distributed-mmap">reports</a></strong> that it can eliminate $641/day in inter-zone networking fees in a typical Kafka production deployment, replacing it with less than $40/day in object storage API costs. Similarly, Redpanda <strong><a href="https://www.redpanda.com/blog/cloud-topics-streaming-data-object-storage">acknowledges</a></strong> the difficulty of escaping such hidden costs in traditional replicated systems.</p><h2>Technical Trade-offs: Disk-Based vs. Zero-Disk</h2><p>The fundamental trade-off in zero-disk architecture can be summarised as: <strong>"</strong><em>relaxing latency requirements to leverage cloud durability and economics of scale</em><strong>".</strong></p><p>This trade-off manifests in several key areas:</p><h4>1. Performance Characteristics</h4><p>Traditional local disk-based storage offers fast access speeds (both for reads and writes) and high IOPS rate, but often lacks inherent scalability and redundancy features. To achieve durability and protection against data loss, disk-based systems require complex setups like RAID and software-level replication for fault tolerance.</p><p>Zero-disk storage systems, on the other hand, leverage the inherent redundancy and fault tolerance of object stores. Deep object storage systems provide safety guarantees, fault tolerance, and versioning out-of-the-box.</p><p>This simplifies system design, removing the requirement to manage data replication, backups, and snapshots to protect against data loss or disasters. However, the price for these benefits is the higher access latency, largely due to inherent object storage design and network I/O.</p><h4>2. Directory and File listing Bottleneck</h4><p>Listing operations for directories and files have been a significant performance bottleneck in distributed filesystems and cloud object stores. This challenge particularly affects large-scale data systems that need to manage extensive collections of files and partitions.</p><p>In query execution, the planning phase can sometimes exceed the actual processing time, especially when dealing with heavily partitioned datasets and large number of small data files.</p><p>Those who have worked with query engines like Hive operating on large HDFS cluster may have seen queries which require several minutes just to complete bulk listing operations, and gathering required metadata for all relevant files and partitions before processing can begin.</p><h4>3. API call Limits</h4><p>The challenge is further compounded by rate-limiting and throttling mechanisms implemented by cloud providers. Cloud object stores impose various operational limits that require careful consideration:</p><ul><li><p> Amazon S3, for example, imposes a limit of 1,000 objects per LIST request when using prefix listing functionality. To overcome this limitation, systems must execute hundreds of parallel calls, and use random prefixes, while each operation typically requiring between tens and hundreds of milliseconds to complete.</p></li><li><p>S3 has a limit of 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. Therefore <a href="https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-data-access-performance-from-amazon-emr-and-aws-glue-to-amazon-s3/?">prefix management</a> and distribution is important to get the level of horizontal scalability and parallelism required in large scale data applications when processing data on S3.</p><p></p></li></ul><h3>Mitigating Access Latencies</h3><p>The higher latency associated with cloud object storage is a crucial challenge, particularly for real-time applications. Several techniques are currently used to mitigate this:</p><ul><li><p><strong>External Metadata Management</strong> - To address the high latencies associated with listing files and directories, it is possible to explicitly maintain metadata records for files and partitions to alleviate this bottleneck. <strong><a href="https://practicaldataengineering.substack.com/p/the-history-and-evolution-of-open">Open Table Formats</a></strong> like <strong>Apache Iceberg</strong>, <strong>Hudi</strong>, and <strong>Delta Lake</strong> store file and partition metadata in log files, minimising expensive API calls during the query planning phase by providing direct access to this information.</p></li><li><p><strong>Local Caching</strong>: Compute nodes can use fast local storage (e.g., NVMe SSDs) as a read-through and write-through cache. This reduces latency by storing hot data locally, and serving writes before flushing to object stores.</p></li><li><p><strong>Exploiting Throughput</strong>: Leveraging the high throughput of object stores by using multiple I/O threads can help improve access speed by executing many parallel I/O operations.</p></li><li><p><strong>Express Object Storage</strong>: New types of object storage, such as Amazon S3 Express One Zone, offering lower latency but with higher costs.</p></li></ul><p>These approaches aim to balance the benefits of cloud object storage with the performance needs of various applications.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9D_O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9D_O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 424w, https://substackcdn.com/image/fetch/$s_!9D_O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 848w, https://substackcdn.com/image/fetch/$s_!9D_O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 1272w, https://substackcdn.com/image/fetch/$s_!9D_O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9D_O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png" width="1456" height="875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:875,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1797165,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9D_O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 424w, https://substackcdn.com/image/fetch/$s_!9D_O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 848w, https://substackcdn.com/image/fetch/$s_!9D_O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 1272w, https://substackcdn.com/image/fetch/$s_!9D_O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf7a125b-c91c-4fce-845d-fc3b2c022a0c_1833x1101.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Strategies for mitigating access latencies</figcaption></figure></div><p></p><h1>Implementation Patterns</h1><p>Different use cases have led to distinct implementation patterns for zero-disk architecture, each addressing specific requirements and constraints.</p><h2>Analytical Workloads on Deep Storage</h2><p>As previously highlighted, analytical systems were among the pioneers in adopting zero-disk architecture, largely due to their ability to tolerate higher latencies.</p><p>Distributed Compute frameworks like <strong>Hive</strong>, <strong>Spark</strong>, and <strong>Presto</strong>/<strong>Trino</strong> have been processing petabyte-scale analytical workloads on cloud-based data lakes for over a decade.</p><p>The emergence of <strong>Snowflake</strong> and the adoption of a similar decoupled architecture by <strong>Redshift</strong> and <strong>BigQuery</strong> further advanced the implementation of OLAP database systems on deep storage. These platforms demonstrated how separating storage and compute could deliver scalability and flexibility without compromising performance.</p><p>To address challenges such as latency and API call overhead, several optimisation techniques are used including external metadata management to reduce expensive API calls, using efficient columnar storage formats like Parquet optimised for cloud object stores, and parallel processing to leverage high throughput capabilities to deliver high performance while benefiting from the cost efficiency and scalability of cloud-native architectures.</p><p>Building on the success of major vendors, the database industry has increasingly embraced this architectural shift, with numerous implementations emerging in recent years and 2024 in particular.</p><p>A recent example is the release of <strong>InfluxDB 3.0</strong>, announced in 2024, which represents a <a href="https://www.influxdata.com/blog/influxdb-3-0-system-architecture/">comprehensive architectural overhaul</a>. This revamped system leverages <strong>Apache Arrow</strong> for in-memory buffering and <strong>Parquet</strong> as the storage format stored on remote cloud storage service.</p><h2>Transactional Systems on Deep Storage</h2><p>While zero-disk architecture has shown considerable success with analytical workloads and offline systems such as data warehouses, implementing transactional systems presents unique challenges.</p><p>OLTP database systems demand strict ACID properties and sub-second latency requirements, including the ability to perform fast in-point updates. These requirements create significant challenges when working with immutable storage layers with high I/O latencies that weren't designed for small random writes and updates.</p><h3>Core Challenges</h3><p>Several fundamental limitations of cloud object stores make implementing full ACID guarantees challenging:</p><h4>Eventual Consistency</h4><p>Prior to AWS S3's introduction of strong consistency in 2021, all INSERT, PUT, and DELETE operations were eventually consistent. This limitation could lead to inconsistencies where changes were not immediately visible due to the lack of strong read-after-write consistency&#8212;a critical requirement for transactional systems.</p><h4>Lack of Mutual Exclusion</h4><p>The lack of mutual exclusion capabilities in object stores presents another significant hurdle. Without built-in support for concurrent writes or "<em><strong>put-if-absent</strong></em>" guarantees, systems risk data loss or corruption from simultaneous writes to the same location.</p><p>Modern object storage frameworks like <strong>Apache Hudi</strong> and <strong>Delta Lake</strong> have addressed this by implementing external locking services, using systems like <strong>DynamoDB</strong> to manage concurrent writes to S3.</p><p>A significant development occurred in 2024 when AWS <a href="https://www.infoq.com/news/2024/08/amazon-s3-conditional-writes">introduced</a> <strong>"Conditional Writes"</strong> for S3, enhancing reliability and efficiency for data operations in S3 object storage. This feature ensures writes only occur if certain conditions are met, reducing the risk of unintentional data overwrites and improving concurrency control.</p><h4>Non-Atomic File Operations</h4><p>Traditional systems like HDFS handle file renaming as fast, atomic metadata operations. In contrast, cloud object stores must physically copy and delete data to achieve the same result. This limitation significantly impacted systems such as Hive, Spark, and HBase that rely on atomic rename operations for managing data files.</p><div><hr></div><p>Given the constraints outlined above, is it feasible to design a general-purpose transactional database system, such as PostgreSQL, that utilises object storage as its primary storage layer? Let&#8217;s find out.</p><h3>Hybrid Architecture Solution</h3><p>While implementing fully zero-disk transactional systems remains challenging, a hybrid architecture has emerged as a practical solution. </p><p>This approach uses a fast replicated write cache combined with periodic offloading to deep storage, based on the <strong>LSM-Tree</strong> algorithm found in <strong><a href="https://practicaldataengineering.substack.com/p/internal-storage-design-of-modern">modern key-value stores</a></strong> like <strong>Cassandra</strong> and <strong>RocksDB</strong>.</p><p>The LSM-Tree architecture, originally designed to eliminate high I/O latencies from random I/O operations on HDDs, proves equally valuable for cloud storage systems. </p><p>This design helps manage both the high I/O latencies and lack of transactional primitives in cloud storage services. Key components include:</p><ul><li><p>A <strong>write-ahead log (WAL)</strong> maintained on local storage (such as EBS volumes) for atomic update guarantees.</p></li><li><p><strong>In-memory caching</strong> with periodic flushing to object storage to eliminate the high I/O cost of performing real-time updates.</p></li><li><p>Query execution that reconciles both in-memory cache and object store data.</p></li></ul><p>This architecture effectively addresses the high IOPS requirements of traditional database systems, which typically make numerous small changes at the page level. </p><p>Several modern systems have successfully implemented variations of this approach:</p><ul><li><p><strong>Neon</strong>'s <strong><a href="https://jack-vanlightly.com/analyses/2023/11/15/neon-serverless-postgresql-asds-chapter-3">Serverless PostgreSQL</a></strong> implementation uses this pattern in its cloud-native database.</p></li><li><p><strong>AutoMQ</strong> <a href="https://docs.automq.com/automq/architecture/overview">employs</a> EBS volumes for write caching and WAL while continuously offloading data to remote storage.</p></li><li><p><strong><a href="https://slatedb.io/">SlateDB</a></strong> embedded database implements an LSM-Tree architecture directly on cloud storage.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_OzB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_OzB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 424w, https://substackcdn.com/image/fetch/$s_!_OzB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 848w, https://substackcdn.com/image/fetch/$s_!_OzB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 1272w, https://substackcdn.com/image/fetch/$s_!_OzB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_OzB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png" width="896" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b260de2-7ac4-43c9-b063-958082779087_896x419.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_OzB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 424w, https://substackcdn.com/image/fetch/$s_!_OzB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 848w, https://substackcdn.com/image/fetch/$s_!_OzB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 1272w, https://substackcdn.com/image/fetch/$s_!_OzB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b260de2-7ac4-43c9-b063-958082779087_896x419.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LSM-Tree architecture on object storage</figcaption></figure></div><p></p><h3>Future Directions</h3><p>The general availability of <strong>S3 Express One Zone</strong>, announced in 2023, offering <strong>single-digit latency</strong>, suggests potential future developments for transactional workloads on deep storage. This service <a href="https://jack-vanlightly.com/blog/2024/6/10/a-cost-analysis-of-replication-vs-s3-express-one-zone-in-transactional-data-systems">could potentially replace</a> the replicated write cache layer presented in the hybrid architecture, enabling fully zero-disk transactional systems.</p><p>However, its current limitations&#8212;including costs seven times higher than standard S3 and single-zone replication&#8212;mean that hybrid approaches are likely to remain prevalent in the near term.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!32mC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!32mC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 424w, https://substackcdn.com/image/fetch/$s_!32mC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 848w, https://substackcdn.com/image/fetch/$s_!32mC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 1272w, https://substackcdn.com/image/fetch/$s_!32mC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!32mC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png" width="896" height="421" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/998d2111-fa02-42db-a383-23003b63b7af_896x421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83636,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!32mC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 424w, https://substackcdn.com/image/fetch/$s_!32mC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 848w, https://substackcdn.com/image/fetch/$s_!32mC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 1272w, https://substackcdn.com/image/fetch/$s_!32mC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998d2111-fa02-42db-a383-23003b63b7af_896x421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LSM-Tree architecture on cloud Express One Zone</figcaption></figure></div><p></p><h2>Real-Time Systems on Deep Storage</h2><p>Similar to OLTP database systems, real-time event-based systems like <strong>Apache Kafka</strong> rely on fast I/O and certain ACID guarantees.</p><p>Constructing low-latency, real-time systems on deep storage presents unique challenges, as traditional disk-based systems depend on high IOPS and low-latency sequential I/O. Writing directly to cloud storage can introduce latencies in the hundreds of milliseconds, which is unacceptable for many real-time scenarios.</p><p>One solution is to mirror the approach presented for transactional systems: combining intermediate fast local write caches with periodic flushing to deep storage using LSM-Tree architecture.</p><p><strong>AutoMQ</strong> implements this strategy by using EBS volumes for write-ahead logging while periodically offloading data to object stores. This architecture allows them to achieve single-digit millisecond P99 latency while still benefiting from cloud storage economics.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gXF4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gXF4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 424w, https://substackcdn.com/image/fetch/$s_!gXF4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 848w, https://substackcdn.com/image/fetch/$s_!gXF4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 1272w, https://substackcdn.com/image/fetch/$s_!gXF4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gXF4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png" width="1456" height="679" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1476229,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gXF4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 424w, https://substackcdn.com/image/fetch/$s_!gXF4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 848w, https://substackcdn.com/image/fetch/$s_!gXF4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 1272w, https://substackcdn.com/image/fetch/$s_!gXF4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F651136f2-af27-4cd3-a072-05eed00a61d9_2094x977.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Zero-Disk Architecture with locally attached SSD Cache</figcaption></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h3>Full Zero-Disk Implementation</h3><p>You might ask, is it possible to completely eliminate intermediate cache for real-time event-driven systems and go fully zero-disk?</p><p>Yes, but this requires accepting the primary trade-off: <strong>increased latency.</strong></p><p>As <strong>WarpStream</strong> co-founder <a href="https://www.warpstream.com/blog/zero-disks-is-better-for-kafka">argues</a>: </p><blockquote><p><em>If you can tolerate a little extra latency, Zero Disk Architectures (ZDAs), with everything running directly through object storage with no intermediary disks, are better. Much better.</em></p></blockquote><p>WarpStream represents a pioneering implementation of this approach. Built as a Kafka-compatible platform directly on cloud storage, it demonstrates both the possibilities and trade-offs of full cloud-native zero-disk architecture to take advantage of the cloud economics and eliminate exorbitant cost of running Kafka on cloud.</p><p>The founder <a href="https://www.warpstream.com/blog/kafka-is-dead-long-live-kafka">asked</a>:</p><blockquote><p> &#8220;<em>What would Kafka look like if it was redesigned from the ground up today to run in modern cloud environments, directly on top of object storage, with no local disks to manage, but still had to support the existing Kafka protocol?</em>&#8221;</p></blockquote><p>The system uses stateless "Agents" instead of traditional brokers, eliminating the need for JVM management and rebalancing operations. However, it accepts higher latencies&#8212;around 600ms for write confirmation and over one second for end-to-end operations&#8212;in exchange for significantly lower costs and operational simplicity.</p><p>For organisations willing to accept slightly higher latency, WarpStream offers a massively simplified and far more cost-effective, cloud-native alternative to Kafka. </p><p>This innovative design, combined with its <strong>Bring Your Own Cloud (BYOC)</strong> deployment model that fully decouples compute from storage, attracted <strong>Confluent</strong>, which <a href="https://www.confluent.io/blog/confluent-acquires-warpstream/">acquired</a> WarpStream in 2024 to integrate its groundbreaking architecture into Confluent's offerings, further expanding <a href="https://www.kai-waehner.de/blog/2024/09/12/deployment-options-for-apache-kafka-self-managed-fully-managed-serverless-and-byoc-bring-your-own-cloud/">deployment options</a>.</p><p></p><h3>Industry Evolution</h3><p>Beyond WarpStream, the real-time pub/sub industry has seen rapid evolution in this space:</p><p><strong>Pinterest</strong> led early adoption, implementing an in-house Kafka alternative pub/sub system called <strong><a href="https://medium.com/pinterest-engineering/memq-an-efficient-scalable-cloud-native-pubsub-system-4402695dd4e7">MemQ</a></strong> in 2020 that proved to be 90% cheaper than traditional Kafka deployments with three-way replication across availability zones.</p><p><strong>Redpanda</strong> made cloud storage their default storage tier in 2022, adopting a cloud-first approach. In 2024, they introduced <strong><a href="https://www.redpanda.com/blog/cloud-topics-streaming-data-object-storage">Cloud Topics</a></strong> where data is written directly to cloud storage while only metadata is replicated across availability zones.</p><p>As highlighted earlier, <strong>AutoMQ</strong> <a href="https://www.automq.com/blog/introducing-automq-cloud-native-replacement-of-apache-kafka">launched</a> in 2023 as a zero-disk Kafka alternative aims to replace disk with object storage as the persistence layer while maintaining the same performance and access latencies by using locally attached cache.</p><p><strong>Confluent</strong>'s trajectory shows the industry's direction. They announced <strong><a href="https://www.confluent.io/blog/cloud-native-data-streaming-kafka-engine/">Kora</a></strong> in 2023, a proprietary cloud-native engine, followed by <strong><a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">Freight Clusters</a></strong> in 2024. These innovations enable direct writing to object storage, eliminating costly cross-AZ broker replication for workloads that can tolerate higher latency, achieving up to 90% cost reduction compared to traditional deployments.</p><h2>Real-Time OLAP and Stream Processing</h2><p>The trend extends beyond messaging systems. <strong>Apache Doris 3.X</strong>, <a href="https://doris.apache.org/blog/release-note-3.0.0">released</a> in late 2024, introduced compute-storage decoupled deployment mode for its OLAP engine, enabling the storage layer to operate on low-cost storage like HDFS and S3.</p><p><strong>StarRocks</strong> OLAP engine v3.0, launched in 2023, introduced a shared-data cluster featuring a storage-compute separation architecture, enabling data persistence into S3-compatible object storage.</p><p>In 2024<strong>, StarRocks</strong> announced preview support for AWS Express One Zone as storage volume in version 3.3, to significantly enhance read and write performance on cloud storage.</p><p><strong>ClickHouse Inc</strong> launched <strong><a href="https://clickhouse.com/blog/clickhouse-cloud-public-beta">ClickHouse Cloud</a></strong> in 2022, with a separate storage and compute architecture, optimised for storing data on shared cloud object store , while using local NVMe SSDs for caching layer both for reads and write paths in order to mitigate the added access latency introduced by the object storage.</p><p><strong>Apache Flink's 2.0</strong> release in late 2024 marked another milestone, introducing a <strong><a href="https://www.alibabacloud.com/blog/evolution-of-flink-2-0-state-management-storage-computing-separation-architecture_601133">cloud-native architecture</a></strong> for state management. This replaced their previous complex tiered storage approach with a simpler model using deep storage as the primary storage layer and local disks as optional cache.</p><h3>The Future of Real-Time Systems</h3><p>The evolution of real-time systems on deep storage reveals a clear trend: </p><div class="pullquote"><p><em><strong>Organisations are increasingly willing to accept higher latencies in exchange for dramatic cost reductions and operational simplicity. This shift is particularly notable in high-throughput workloads like log ingestion, where sub-second latency isn't critical.</strong></em></p></div><p>The success of these implementations suggests that the future of real-time systems might involve a spectrum of solutions, from hybrid approaches that prioritise performance to full zero-disk implementations that optimise for cost and simplicity. </p><p>The choice between these approaches will likely depend on specific use cases and latency requirements rather than technical limitations.</p><h2>Other Emerging Patterns and Solutions</h2><p>In addition to the growing adoption of zero-disk architecture, other emerging architectural patterns are leveraging cloud storage without fully transitioning to a zero-disk model.</p><h3>Heterogenous Tiered Storage</h3><p>Modern systems are implementing sophisticated tiered storage strategies to balance <strong>performance</strong> and <strong>cost</strong>. </p><p>Cheap deep storage provides a compelling choice for long-term storage of historical cold data or alternatively provide a read-only replica outside of the main storage system to external query processors.</p><p>Beyond established storage engines like <strong>Apache Druid</strong> and <strong>Pinot</strong>, which utilise external deep storage to offload older data segments, a growing number of disk-based storage systems have adopted similar approaches in recent years.</p><p><strong>Redpanda</strong> introduced their <em><strong><a href="https://www.redpanda.com/blog/tiered-storage-architecture-shadow-indexing-deep-dive">Archival Storage</a></strong></em> subsystem in 2021, enabling automatic movement of data between tiers, which moves large amount of data between the brokers and the remote cloud storage automatically with the objective of providing <em><strong>infinite data retention with good performance at a low cost</strong></em>,</p><p><strong>Apache Kafka</strong> has also embraced this approach. With the release of Kafka 3.6 in 2023 initially proposed by <strong><a href="https://www.uber.com/en-AU/blog/kafka-tiered-storage">Uber</a></strong>, the platform introduced tiered storage support with pluggable remote storage options. The system allows for configurable retention policies per tier and automates segment migration between storage layers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!knAs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!knAs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 424w, https://substackcdn.com/image/fetch/$s_!knAs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 848w, https://substackcdn.com/image/fetch/$s_!knAs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 1272w, https://substackcdn.com/image/fetch/$s_!knAs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!knAs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png" width="1456" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!knAs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 424w, https://substackcdn.com/image/fetch/$s_!knAs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 848w, https://substackcdn.com/image/fetch/$s_!knAs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 1272w, https://substackcdn.com/image/fetch/$s_!knAs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6e34bc-d88b-494c-b855-750a1c0f9837_1600x424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"> source: <a href="https://www.uber.com/en-AU/blog/kafka-tiered-storage">link</a></figcaption></figure></div><p><strong>Crunchy Data</strong> <a href="https://www.crunchydata.com/blog/syncing-postgres-partitions-to-your-data-lake-in-bridge-for-analytics">implemented a feature</a> in their managed PostgreSQL service to copy data into the data lake for cheaper and long term storage supporting Parquet format.</p><h3>Remote Read Replicas</h3><p>Redpanda's <strong>Remote Read Replica</strong> functionality,<a href="https://www.redpanda.com/blog/remote-read-replicas-for-distributing-work"> introduced in 2022</a>, represents another innovative approach to leveraging object storage in a semi-disk-based approach. </p><p>This feature enables separate consumer clusters to access archived data without impacting primary production clusters. It also allows for the integration of topics from multiple production clusters, providing unified read-only access for downstream consumers.</p><div><hr></div><p>While the implementation of tiered storage and remote read replica by disk-based software systems such as Kafka and Redpanda try to take advantage of cloud-storage economics without a full re-architecture of their systems, however it still doesn't live up to the promise of reduced total cost and simplicity offered by a full zero-disk architecture.</p><h1>Conclusion</h1><p>The rise of zero-disk architecture represents a fundamental shift in storage system design, driven by cloud economics and the maturation of distributed object storage. While this approach introduces certain trade-offs, particularly around latency, the benefits in terms of cost reduction, operational simplicity, and built-in reliability make it an increasingly attractive option for modern applications.</p><p>As cloud infrastructure continues to evolve and new techniques for managing latency emerge, the balance between performance and cost will likely continue to shift in favour of zero-disk architectures, particularly for organisations prioritising cost optimisation over sub-second latency requirements.</p><p>Recent industry developments further demonstrate the momentum behind zero-disk architectures. The introduction of cloud-native Kafka solutions like Confluent's Kora and Freight Clusters, which integrate directly with object stores in 2024, along with Confluent's acquisition of WarpStream, the launch of Flink 2.0 with a cloud-native architecture, The full re-architecture of InfluxDB, and Redpanda's emphasis on cloud-first strategies and cloud topics, all signal a growing shift toward cloud-native, zero-disk architectures as a dominant paradigm for modern storage systems.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[The Rise of Single-Node Processing: Challenging the Distributed-First Mindset]]></title><description><![CDATA[Data Landscape Trends: 2024-2025 Series]]></description><link>https://www.pracdata.io/p/the-rise-of-single-node-processing</link><guid isPermaLink="false">https://www.pracdata.io/p/the-rise-of-single-node-processing</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Mon, 06 Jan 2025 11:08:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ibg0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ibg0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 424w, https://substackcdn.com/image/fetch/$s_!Ibg0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 848w, https://substackcdn.com/image/fetch/$s_!Ibg0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 1272w, https://substackcdn.com/image/fetch/$s_!Ibg0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ibg0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png" width="1002" height="708" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:708,&quot;width&quot;:1002,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:345461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ibg0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 424w, https://substackcdn.com/image/fetch/$s_!Ibg0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 848w, https://substackcdn.com/image/fetch/$s_!Ibg0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 1272w, https://substackcdn.com/image/fetch/$s_!Ibg0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F225d89a3-ade0-4e13-9128-d343e8d3974f_1002x708.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is part two of <strong>Data Landscape Trends 2024-2025</strong> series, focusing on single-node processing trends.</p><h2>Introduction</h2><p>2024 witnessed growing interest in single-node processing frameworks, with tools like <strong>DuckDB</strong>, <strong>Apache DataFusion</strong>, and <strong>Polars</strong> receiving increased attention and gaining unprecedented popularity from the data community.</p><p>This trend represents more than just a technological advancement&#8212;it marks a fundamental reassessment of how we approach data analytics. </p><p>As we move away from the "big data" era's distributed-first mindset, many businesses are discovering that single-node processing solutions often provide a more efficient, cost-effective, and manageable approach to their analytical needs when their size of data is not that big.</p><p>When I recently published a <strong><a href="https://www.linkedin.com/posts/alirezasadeghi_why-single-node-engines-are-gaining-ground-activity-7252649616989450243-sbml?utm_source=share&amp;utm_medium=member_desktop">short post</a></strong> on LinkedIn titled <em><strong>"Why Single-Node Engines Are Gaining Ground in Data Processing</strong></em>", I didn&#8217;t anticipate the significant attention it would receive from the LinkedIn data community. This response underscored the industry&#8217;s increasing interest in the topic.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ii7s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ii7s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 424w, https://substackcdn.com/image/fetch/$s_!ii7s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 848w, https://substackcdn.com/image/fetch/$s_!ii7s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!ii7s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ii7s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png" width="1054" height="1144" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1144,&quot;width&quot;:1054,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:656758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ii7s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 424w, https://substackcdn.com/image/fetch/$s_!ii7s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 848w, https://substackcdn.com/image/fetch/$s_!ii7s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!ii7s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8b53b4-0453-438d-9a3f-10417f1ebb84_1054x1144.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>In this article, I will dive deeper into the subject, exploring it in greater detail and providing further insights.</p><h1>Rethinking Big Data</h1><p>The past decade saw many businesses scrambling to implement big data strategies, with many companies investing heavily in distributed processing frameworks like <strong>Hadoop</strong> and <strong>Spark</strong>.</p><p>However, recent analyses reveal a surprising truth: most companies don't actually have "<strong>big data</strong>".</p><p>A significant majority of companies <a href="https://kjhealey.medium.com/cached-takes-80-of-companies-do-not-need-snowflake-or-databricks-5ebda64c0853">do not require</a> large data platforms to address their data analytics needs. Often, these companies are swayed by marketing hype and make substantial investments in these platforms, which may not effectively resolve their actual data challenges.</p><p>Jordan Tigani, a founding engineer on <strong>Google BigQuery</strong>, <a href="https://motherduck.com/blog/big-data-is-dead/">analysed</a> usage patterns and found that the median data storage size among heavy BigQuery users is less than 100 GB.</p><p>Even more revealing, an <a href="https://motherduck.com/blog/redshift-files-hunt-for-big-data/">analysis</a> of half a billion queries run on <strong>Amazon Redshift</strong> published in a paper showed that:</p><ul><li><p>Over 99% of queries processed less than 10 TB of data.</p></li><li><p>Over 90% of sessions processed less than 1 TB.</p></li></ul><p>The <a href="https://www.vldb.org/pvldb/vol17/p3694-saxena.pdf">paper</a> also states that:</p><blockquote><p>Most tables have less than a million rows and the vast majority (98 %) has less than a billion rows. Much of this data is small enough such that it can be cached or replicated.</p></blockquote><p>This analysis reveals that with a big data processing threshold of 1 TB, over 90% of queries fall below this threshold.</p><p>As a result, single-node processing engines have the potential to handle workloads that previously required distributed systems like Spark, Trino, or Amazon Athena to process across multiple machines.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fNW8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fNW8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 424w, https://substackcdn.com/image/fetch/$s_!fNW8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 848w, https://substackcdn.com/image/fetch/$s_!fNW8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 1272w, https://substackcdn.com/image/fetch/$s_!fNW8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fNW8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png" width="1201" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1201,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:337292,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fNW8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 424w, https://substackcdn.com/image/fetch/$s_!fNW8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 848w, https://substackcdn.com/image/fetch/$s_!fNW8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 1272w, https://substackcdn.com/image/fetch/$s_!fNW8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b2f85fc-45e6-413b-87f2-b1edc0a6baac_1201x812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This reality challenges the common notion that big data infrastructure is a necessity for all modern businesses.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Workload Patterns &amp; Rapid Data Aging</h1><p>The case for single-node processing becomes even more compelling when we examine how organisations actually use their data. </p><p>Two key patterns emerge: the data aging effect and the 90/10 rule of analytical workloads.</p><h3>The Data Aging Effect</h3><p>As data ages, access frequency declines sharply. For the majority of companies, data access patterns follow a predictable lifecycle:</p><ul><li><p><strong>Hot data</strong> (0-48 hours): primarily from ETL pipelines.</p></li><li><p><strong>Warm data</strong> (2-30 days): Accounts for most analytical queries.</p></li><li><p><strong>Cold data</strong> (30+ days): Rarely accessed but often retained for compliance or historical analysis.</p></li></ul><p>A study of <a href="https://learning.oreilly.com/videos/strata-conference-santa/9781491900321/9781491900321-oreillyvideos1976679/">Meta</a> and <a href="https://innovation.ebayinc.com/tech/engineering/hdfs-storage-efficiency-using-tiered-storage/">eBay</a>'s data access patterns revealed this sharp decline in access after the first few days, with data typically becoming cold after a month.</p><p>In our analysis of a petabyte-scale data lake, we found that raw data remains hot for only 48 hours, with 95% of access occurring in that time, mainly by downstream ETL pipelines. In Analytics (Gold) zone, the hot period lasts about 7 days, and 95% of queries are executed only within 30 days.</p><h3>The 90/10 Rule for Analytical Workloads</h3><p>This aging effect leads to the 90/10 rule in analytical workloads:</p><p>If the combined hot and warm period is 30 days accounting for 90% of workloads, then, with a one-year retention period, over 90% of workloads access fewer than 10% of the data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fHif!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fHif!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 424w, https://substackcdn.com/image/fetch/$s_!fHif!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 848w, https://substackcdn.com/image/fetch/$s_!fHif!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 1272w, https://substackcdn.com/image/fetch/$s_!fHif!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fHif!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png" width="1041" height="1340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1340,&quot;width&quot;:1041,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:386720,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fHif!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 424w, https://substackcdn.com/image/fetch/$s_!fHif!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 848w, https://substackcdn.com/image/fetch/$s_!fHif!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 1272w, https://substackcdn.com/image/fetch/$s_!fHif!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb041961b-4bcf-4b19-83a2-90aa77b42139_1041x1340.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This pattern holds remarkably consistent across industries and use cases. Even in organisations with large datasets, most analytical workloads operate on recent, aggregated data that could easily fit within single-node processing capabilities.</p><h1>Hardware Evolution &amp; Rethinking Scale Up</h1><p>The capability of single-node systems has grown exponentially since the early days of big data. </p><p>The rationale and motivation behind the scale-out strategy which became popular with emergence of Hadoop in mid 2000s in data processing is the necessity of combining multiple machines to address scaling challenges, enabling efficient processing of large datasets within reasonable timeframes and performance levels.</p><p>By integrating multiple machines in distributed systems, we effectively create a single large unit, pooling resources such as RAM, CPU, disk space, and bandwidth into one large virtual machine.</p><p>However, we need to reassess our assumptions about distributed processing and the scaling challenges faced in the 2000s to see if they remain valid today.</p><p>In 2006, when Hadoop MapReduce emerged, the first AWS EC2 instances (m1.small) had just 1 CPU and less than 2 GB RAM. Today's cloud providers offer instances with 64+ cores and 256GB+ of RAM, fundamentally changing the equation for what's possible with single-node processing.</p><p>Examining the evolution of balanced EC2 instances in terms of memory and CPU (with a 1:4 ratio) over the years reveals exponential growth, as these instances have become increasingly powerful over time.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eSJI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eSJI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 424w, https://substackcdn.com/image/fetch/$s_!eSJI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 848w, https://substackcdn.com/image/fetch/$s_!eSJI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 1272w, https://substackcdn.com/image/fetch/$s_!eSJI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eSJI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png" width="724" height="222.76923076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:448,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:217852,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eSJI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 424w, https://substackcdn.com/image/fetch/$s_!eSJI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 848w, https://substackcdn.com/image/fetch/$s_!eSJI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 1272w, https://substackcdn.com/image/fetch/$s_!eSJI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3176164-f6f6-4a59-b5e4-1c41c4e6094f_2172x668.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><h3>The Economics of Scale-Up vs. Scale-Out</h3><p>One might assume that scaling out across multiple smaller instances is more cost-effective than using larger instances. However, cloud pricing models tell a different story.</p><p>The cost per compute unit on cloud is consistent whether you use a smaller instance or larger one as the cost increases linearly.</p><p>That is the cost of larger cloud compute instances increases linearly and the overall price remains the same regardless of whether you use one larger instance or multiple smaller instances, as long as the total number of cores and memory is the same.</p><p>Using AWS's <strong>m5</strong> instance family as an example, regardless of whether you scale up with a single <strong>m5.16xlarge</strong> instance or scale out with eight <strong>m5.2xlarge</strong> instances, the price per hour will remain the same.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sOa8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sOa8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 424w, https://substackcdn.com/image/fetch/$s_!sOa8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 848w, https://substackcdn.com/image/fetch/$s_!sOa8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 1272w, https://substackcdn.com/image/fetch/$s_!sOa8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sOa8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png" width="882" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:882,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sOa8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 424w, https://substackcdn.com/image/fetch/$s_!sOa8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 848w, https://substackcdn.com/image/fetch/$s_!sOa8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 1272w, https://substackcdn.com/image/fetch/$s_!sOa8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9fcb1d-be8b-47db-9514-41e48a58308e_882x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This hardware evolution has important implications for system architecture decisions as:</p><div class="pullquote"><p> Modern instances can handle workloads that previously required dozens of smaller nodes, and they do so with reduced complexity and overhead.</p></div><p><strong>This raises a critical question:</strong></p><p>From a cost-performance perspective, if a single-node query engine can handle the majority of workloads efficiently, is there still a benefit to distributing processing across multiple nodes?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XlD1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XlD1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 424w, https://substackcdn.com/image/fetch/$s_!XlD1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 848w, https://substackcdn.com/image/fetch/$s_!XlD1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 1272w, https://substackcdn.com/image/fetch/$s_!XlD1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XlD1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png" width="1156" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1156,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:500585,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XlD1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 424w, https://substackcdn.com/image/fetch/$s_!XlD1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 848w, https://substackcdn.com/image/fetch/$s_!XlD1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 1272w, https://substackcdn.com/image/fetch/$s_!XlD1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6db7c4-5fe4-4a6c-81b1-da7d15443e66_1156x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>The Performance Case for Single-Node Processing</h1><p>Modern single-node processing engines leverage advanced techniques to deliver impressive performance.</p><p>Engines like <strong>DuckDB</strong> and <strong>Apache DataFusion</strong> achieve superior performance through sophisticated optimization techniques, including vectorized execution, parallel processing, and efficient memory management</p><p>Numerous benchmarks illustrate these performance improvements:</p><ul><li><p>Vantage <a href="https://www.vantage.sh/blog/clickhouse-local-vs-duckdb">reported</a> that when they switched from Postgres to DuckDB for cloud cost analysis, they saw performance improvements between 4X and 200X.</p></li><li><p>Fivetran's CEO <a href="https://www.fivetran.com/blog/how-fast-is-duckdb-really">benchmarks</a> using TPC-DS datasets showed DuckDB outperforming commercial data warehouses for datasets under 300 GB.</p></li><li><p>An <a href="https://fet.dev/posts/throwing-lots-of-data-on-duckdb">experiment</a> with 1 billion row fake order data, comparing DuckDB with Amazon Athena.</p></li></ul><p></p><h1>Why Choose Single-Node Processing?</h1><p>The case for single-node processing extends beyond just performance. For the majority of businesses, modern single-node engines offer several compelling advantages:</p><ul><li><p>They dramatically simplify system architecture by eliminating the complexity of distributed systems. This simplification reduces operational overhead, makes debugging easier, and lowers the barrier to entry for teams working with data.</p></li><li><p>They often provide better resource utilisation. Without the overhead of network communication and distributed coordination, more computing power can be dedicated to actual data processing. This efficiency translates directly to cost savings and improved performance.</p></li><li><p>They offer excellent integration with modern data workflows. Engines like <strong>chDB</strong> and <strong>DuckDB</strong> can directly query data from cloud storage, work seamlessly with popular programming languages, and fit naturally into existing data pipelines.</p></li><li><p>The embeddable nature of some of these engines enables seamless integration with existing systems&#8212;from PostgreSQL extensions like <strong><a href="https://github.com/paradedb/pg_analytics">pg_analytics</a></strong> and <strong><a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a></strong> to various modern <strong><a href="https://practicaldataengineering.substack.com/p/the-evolution-of-business-intelligence-stack">Business Intelligence tools</a></strong>&#8212;expanding analytical capabilities without disrupting established workflows.</p><p></p></li></ul><h2>Challenges and Limitations</h2><p>While single-node processing offers many advantages, it's important to acknowledge its limitations. </p><p>Some engines still face challenges in fully utilising all available CPU cores on large machines, particularly as core counts continue to increase. Memory hierarchy bandwidth between RAM and CPU can become a bottleneck for certain workloads.</p><p>When reading from cloud storage like S3, single-connection transfer speeds may be limited, though this can often be mitigated through parallel connections and intelligent caching strategies. And naturally, there remain workloads involving very large datasets that exceed available memory and storage, requiring distributed processing.</p><h1>Conclusion</h1><p>The rise of single-node processing engines represents a pragmatic shift in data analytics. As hardware capabilities continue to advance and single-node engines become more sophisticated, the need for distributed processing will likely continue to decrease for most organisations. </p><p>For the vast majority of companies, single-node processing frameworks offer a more efficient, cost-effective, and manageable solution to their data analytics needs. As we move forward, the key is not to automatically reach for distributed solutions, but to carefully evaluate actual workload requirements and choose the right tool for the job. </p><p>The future of data processing may well be less about managing clusters and more about leveraging the impressive capabilities of modern single-node systems.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[The Evolution of Business Intelligence: From Monolithic to Composable Architecture]]></title><description><![CDATA[The Business Intelligence (BI) landscape has undergone significant transformation in recent years, particularly in how data is presented and processed.]]></description><link>https://www.pracdata.io/p/the-evolution-of-business-intelligence-stack</link><guid isPermaLink="false">https://www.pracdata.io/p/the-evolution-of-business-intelligence-stack</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Wed, 18 Dec 2024 13:06:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7d3e7046-6c4c-49f3-b45b-a339b6945f4b_403x364.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!33iJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!33iJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 424w, https://substackcdn.com/image/fetch/$s_!33iJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 848w, https://substackcdn.com/image/fetch/$s_!33iJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 1272w, https://substackcdn.com/image/fetch/$s_!33iJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!33iJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png" width="1336" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2404a933-712e-4779-a896-2b1854714156_1336x944.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1336,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:504652,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!33iJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 424w, https://substackcdn.com/image/fetch/$s_!33iJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 848w, https://substackcdn.com/image/fetch/$s_!33iJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 1272w, https://substackcdn.com/image/fetch/$s_!33iJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2404a933-712e-4779-a896-2b1854714156_1336x944.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we dive into 2025, the data engineering field continues its dramatic evolution. In this series, we'll explore the transformative trends reshaping the data engineering landscape, from emerging architectural patterns to new tooling approaches. </p><p>This is part one of our series, focusing on the evolution of Business Intelligence architecture.</p><h1>Introduction</h1><p>The <strong>Business Intelligence (BI)</strong> landscape has undergone significant transformation in recent years, particularly in how data is presented and processed.</p><p>This evolution reflects a broader shift from monolithic architectures to more <strong>flexible</strong>, <strong>composable</strong> solutions that better serve modern analytics needs.</p><p>This article traces the evolution of BI architecture through several key phases: from traditional monolithic systems, through the emergence of headless and bottomless BI, to the latest developments in BI-as-Code and embedded analytics.</p><p></p><h1>Traditional BI Architecture: The Monolithic Approach</h1><p>Traditional BI tools were built as comprehensive, tightly-coupled systems with a significant focus on user interface design.</p><p>These systems provided extensive flexibility through click-through functionality for slicing, dicing, and grouping data using various visualisations. At their core, these systems were composed of three interconnected components that worked in harmony to deliver business insights.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wgDg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wgDg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 424w, https://substackcdn.com/image/fetch/$s_!wgDg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 848w, https://substackcdn.com/image/fetch/$s_!wgDg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 1272w, https://substackcdn.com/image/fetch/$s_!wgDg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wgDg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png" width="473" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:473,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:249446,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wgDg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 424w, https://substackcdn.com/image/fetch/$s_!wgDg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 848w, https://substackcdn.com/image/fetch/$s_!wgDg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 1272w, https://substackcdn.com/image/fetch/$s_!wgDg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F647ca6e8-e2d4-428c-b8c2-bc0ae23c437e_473x487.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Traditional BI Stack</figcaption></figure></div><p></p><p>The backend layer served as the foundation, handling data ingestion from OLAP sources and building optimised data cubes on the server. These cubes contained pre-computed dimensions that enabled real-time data exploration.</p><p>Working in concert with the backend, the frontend layer provided the visualisation interface, connecting to the backend to access data cubes and construct dashboards. </p><p>The semantic layer completed the architecture by defining key performance indicators (KPIs) and metrics embedded within the BI software.</p><p></p><h2>The Drawbacks of Traditional BI Tools</h2><p>While powerful, these traditional systems came with significant overhead.</p><p>Organisations needed substantial infrastructure for on-premise deployment before managed Cloud BI services become more accessible, and the licensing costs were often prohibitive.</p><p>Implementation timelines stretched long, with even proof-of-concept demonstrations requiring weeks of setup and configuration. For businesses serving large user bases, the resource requirements were particularly demanding.</p><p>These fundamental limitations, combined with the growing need for flexibility and cost-effectiveness, sparked a series of architectural innovations in the BI landscape.</p><p></p><h1>The Rise of Bottomless BI Tools</h1><p>In response to these challenges, a new generation of lightweight, disaggregated BI tools emerged. Notable open-source solutions like <strong>Apache Superset</strong>, <strong>Metabase</strong>, and <strong>Redash</strong> began appearing about a decade ago, with Superset, originally developed at Airbnb, gaining particular prominence in the ecosystem.</p><p>These new tools adopted a "<em><strong>bottomless</strong></em>" architecture, eliminating the heavy backend server traditionally used for computation and building and caching cube objects.</p><p>Instead of maintaining their own computation layer, they rely on connected source engines for querying and providing data to dashboards at runtime. This architectural shift introduces different strategies for data serving.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fyxt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fyxt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 424w, https://substackcdn.com/image/fetch/$s_!fyxt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 848w, https://substackcdn.com/image/fetch/$s_!fyxt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 1272w, https://substackcdn.com/image/fetch/$s_!fyxt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fyxt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png" width="1456" height="1147" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1147,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:967942,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fyxt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 424w, https://substackcdn.com/image/fetch/$s_!fyxt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 848w, https://substackcdn.com/image/fetch/$s_!fyxt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 1272w, https://substackcdn.com/image/fetch/$s_!fyxt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5dc814-0430-4e4b-bd4b-9b8fba42186c_1569x1236.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Dealing With Query Latency</h2><p>The elimination of the backend report server presents bottomless BI tools with a significant challenge: managing query latency when accessing data in real-time.</p><p>To address this challenge, these tools employ several optimisation strategies. One key approach involves utilising pre-computed aggregates stored in the primary data warehouse, allowing dashboards to serve results efficiently.</p><p>Additionally, tools like Superset implement caching layers using Redis to store frequently accessed datasets. This caching mechanism proves particularly effective: once an initial query loads a dataset, subsequent visualisations and dashboard reloads can access the cached version until the underlying data changes, significantly reducing response times.</p><p>For companies handling larger data volumes, integration with specialised real-time OLAP engines like <strong>Druid</strong> and <strong>ClickHouse</strong> provides low-latency analytics capabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C0z-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C0z-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 424w, https://substackcdn.com/image/fetch/$s_!C0z-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 848w, https://substackcdn.com/image/fetch/$s_!C0z-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 1272w, https://substackcdn.com/image/fetch/$s_!C0z-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C0z-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png" width="1339" height="1173" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1173,&quot;width&quot;:1339,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:624464,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C0z-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 424w, https://substackcdn.com/image/fetch/$s_!C0z-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 848w, https://substackcdn.com/image/fetch/$s_!C0z-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 1272w, https://substackcdn.com/image/fetch/$s_!C0z-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2666a09-2cf8-4cba-9045-111ed97e04ce_1339x1173.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>The Emergence of Universal Semantic Layer</h1><p>As the industry sought more flexibility in their BI stack, portable semantic layer or what is known as <em><strong>headless BI</strong></em> emerged as an intermediate step between traditional monolithic systems and fully lightweight solutions.</p><p>Headless BI platforms provide a dedicated semantic layer and some combine query engine while allowing organisations to use any frontend visualisation tool of their choice. This approach fully disaggregates the presentation layer (front-end) from the semantic layer.</p><p>With tools like <strong>Cube</strong> and <strong>MetricFlow</strong> (now part of dbt Labs), for example, organisations can define their metrics and data models in a central location, then connect various visualisation tools, custom applications, or lightweight BI solutions to this semantic layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I7bW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I7bW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 424w, https://substackcdn.com/image/fetch/$s_!I7bW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 848w, https://substackcdn.com/image/fetch/$s_!I7bW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 1272w, https://substackcdn.com/image/fetch/$s_!I7bW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I7bW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png" width="542" height="755" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:542,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:367088,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I7bW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 424w, https://substackcdn.com/image/fetch/$s_!I7bW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 848w, https://substackcdn.com/image/fetch/$s_!I7bW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 1272w, https://substackcdn.com/image/fetch/$s_!I7bW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164e7b01-2880-44dc-9fc0-b3d05b3c7a17_542x755.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This architectural pattern offers several advantages over traditional BI systems. It enables organisations to maintain consistent metric definitions across different visualisation tools, supports multiple frontend applications simultaneously, and provides better integration capabilities with modern data stacks.</p><p>The semantic layer acts as a universal translator between data sources and visualisation layers, ensuring consistent business logic across all analytics applications.</p><p></p><h1>The BI-as-Code Movement</h1><p>Recent years have witnessed the emergence of <strong>BI-as-Code</strong>, representing an even lighter approach to dashboard and interactive data app development.</p><p>This paradigm shift brings software engineering workflows to BI development, enabling version control, testing, and continuous integration practices. Since code serves as the primary abstraction rather than a user interface, developers can implement proper development workflows before deploying to production.</p><p>Prominent tools in this space, such as <strong>Streamlit</strong>, integrate seamlessly with the Python ecosystem, allowing developers to remain within their Python projects without requiring external software installation for building dashboards and data applications. </p><p>The approach emphasises simplicity and speed, using SQL and declarative tools like <strong>YAML</strong> for dashboard creation. The resulting web apps can be easily self-hosted, providing deployment flexibility.</p><p>While <strong>Streamlit</strong> leads the pack in popularity, open source newcomers like <strong>Evidence</strong>, <strong>Rill</strong>, <strong>Vizro</strong>, and <strong>Quary</strong> have emerged in recent years, each bringing their own approach to the BI-as-Code concept.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DcoL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DcoL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 424w, https://substackcdn.com/image/fetch/$s_!DcoL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 848w, https://substackcdn.com/image/fetch/$s_!DcoL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 1272w, https://substackcdn.com/image/fetch/$s_!DcoL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DcoL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png" width="688" height="745" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:745,&quot;width&quot;:688,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:333328,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DcoL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 424w, https://substackcdn.com/image/fetch/$s_!DcoL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 848w, https://substackcdn.com/image/fetch/$s_!DcoL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 1272w, https://substackcdn.com/image/fetch/$s_!DcoL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14fec8b-4fa1-43a3-8aee-e8c4f17905f9_688x745.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>BI-as-Code Limitations</h2><p>BI-as-Code tools currently have limitations in terms of interactive data exploration features and providing enterprise-grade BI capabilities.</p><p>They don't provide the same user experience for slicing and dicing data as traditional BI tools, and they lack data governance and semantic layer support found in both traditional and lightweight BI solutions.</p><p> Nevertheless, BI-as-Code are increasingly being adopted in various ways such as data science teams creating interactive standalone apps, product teams building embedded analytics features and analysts developing internal data applications.</p><p></p><h1>The New Emerging Trend: BI + Embedded Analytics</h1><p>The latest evolution in BI architecture involves integrating high-performance, embeddable OLAP query engines like <strong>Apache DataFusion</strong> and <strong>DuckDB</strong>.</p><p>This approach bridges several gaps in the current landscape while maintaining the benefits of lightweight, disaggregated architectures.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ylfR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ylfR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 424w, https://substackcdn.com/image/fetch/$s_!ylfR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 848w, https://substackcdn.com/image/fetch/$s_!ylfR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!ylfR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ylfR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png" width="1209" height="1094" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1094,&quot;width&quot;:1209,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:755619,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ylfR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 424w, https://substackcdn.com/image/fetch/$s_!ylfR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 848w, https://substackcdn.com/image/fetch/$s_!ylfR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!ylfR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe69b73-ed14-4492-b18f-82a58dd5b80d_1209x1094.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The new full-stack composable BI architecture brings several key advantages:</p><ul><li><p>First, it offers true composability and interoperability, with ability to swap embedded compute engines as needed while maintaining a standalone semantic layer for metric definitions.</p></li><li><p>The embedded analytics capabilities are particularly powerful, with zero-copy integration through standard frameworks mainly <strong>Apache Arrow</strong> enabling microsecond-level data access through optimised in-memory columnar formats.</p></li></ul><div class="pullquote"><p><em><strong>Zero-copy</strong> integration refers to a performance optimisation technique where data can be accessed and processed without needing to serialise and convert data between different in-memory representations. In the context of DataFusion and Apache Arrow, this means that when data is loaded into memory in Arrow's columnar format, DataFusion can directly perform computations on this data without needing to convert or copy it into its own internal format.</em> </p></div><ul><li><p>The direct support for <strong>data lakes</strong> and <strong>lakehouses</strong> represents another significant advance, allowing teams to build dashboards directly on top of open table formats like <strong>Apache Iceberg</strong> and <strong>Apache Hudi</strong> without intermediate data movement.</p></li><li><p>This capability, combined with comprehensive federated query support, resolves a long-standing challenge in exisitng lightweight BI tools that struggled to effectively combine data from multiple sources without requiring to use an external federated query engine.</p></li></ul><h2>Industry Adoption</h2><p>Industry adoption of embedded query engines is gaining significant momentum across the BI ecosystem. Commercial vendors are leading this transformation: <strong>Omni</strong> has <a href="https://omni.co/blog/DuckDB-complements-BI">integrated</a> DuckDB as its core analytics engine, while <strong>Cube.dev</strong> has implemented a sophisticated combination of Apache Arrow and DataFusion in its headless BI architecture.</p><p>Similarly, <strong><a href="https://www.gooddata.com/blog/analytics-stack-with-apache-arrow/">GoodData</a></strong> has embraced this trend by implementing Apache Arrow as the foundation of its FlexQuery system's caching layer, and <strong>Preset</strong> (Managed Superset) has <a href="https://preset.io/blog/preset-and-motherduck-the-easiest-way-to-connect-apache-superset-to-duckdb/">integrated</a> with <strong>MotherDuck</strong> (Managed DuckDB ).</p><p>In the open-source space, both <strong>Superset</strong> (using <a href="https://pypi.org/project/duckdb-engine/">duckdb-engine</a> library ) and <strong>Metabase</strong> now support embedded DuckDB connection, with potential future integration into their core engines.</p><p>The BI-as-Code movement has also embraced embedded engines. <strong>Rilldata</strong> <a href="https://www.rilldata.com/blog/why-we-built-rill-with-duckdb">announced</a> DuckDB integration in 2023 for auto-profiling and interactive modeling in dashboard development, while <strong>Evidence</strong> introduced <a href="https://evidence.dev/blog/why-we-built-usql">Universal SQL</a> in 2024, powered by DuckDB's WebAssembly implementation.</p><h1>Conclusion</h1><p>The Business Intelligence landscape continues its evolution toward more flexible, efficient solutions.</p><p>Each architectural evolution has brought distinct advantages: headless BI enabled consistent metrics across tools, bottomless BI reduced infrastructure complexity, BI-as-Code brought developer workflows to analytics, and embedded engines are now combining these benefits with high-performance query capabilities.</p><p>The integration of embedded query engines with lightweight BI tools represents a promising direction for implementation of lightweight BI, combining the best aspects of traditional BI capabilities with modern architectural patterns. As these technologies mature and the ecosystem grows, companies can look forward to increasingly sophisticated yet composable solutions for their data analysis needs.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[DLD #4 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Rise of single-node engines, Postgres + DuckDB, Airflow alerting techniques, BigQuery continuos queries, top data engineering books and more]]></description><link>https://www.pracdata.io/p/dld-4-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-4-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 03 Nov 2024 07:55:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#10024; Featured - The Rise of Single-Node Engines</h1><div><hr></div><p>In a recent <strong><a href="https://motherduck.com/blog/redshift-files-hunt-for-big-data/">blog post</a></strong>, <strong>Jordan Tigani</strong>, MotherDuck co-founder and  former tech lead at Google BigQuery, highlighted that most companies don't actually deal with "<strong>Big Data</strong>." An analysis of half a billion sample queries run on Amazon Redshift revealed that over 80% of the queries processed less than 1 TB of data.</p><p>Even among the small percentage of businesses that do handle Big Data, the majority of queries (95%) are executed on smaller tables. In cases where companies have large datasets, with some tables exceeding 10 TB, about 96% of the queries still target smaller, likely aggregated tables of 100 GB or less, rather than the actual large tables.</p><p>A <strong><a href="https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet">published paper</a></strong> by AWS also notes that most tables contain fewer than a million rows, with the vast majority (98%) having fewer than a billion rows. </p><p>These findings, combined with advancements in software and hardware technology that have significantly enhanced the processing capabilities of single-node systems, suggest that powerful emerging single-node compute engines like <strong>DuckDB</strong> could increasingly handle many non-big data use-cases. This would reduce the need for distributed processing frameworks like <strong>Spark</strong>, provided that the maturity and ecosystem support of these engines continue to evolve.</p><p>In a recent <strong><a href="https://www.linkedin.com/posts/alirezasadeghi_why-single-node-engines-are-gaining-ground-activity-7252649616989450243-sbml?utm_source=share">LinkedIn post</a></strong>, I shared a graph with some of these points, which sparked an engaging discussion among the data engineering community. The conversation featured a mix of opinions on the future of computing, and it's worth checking out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mxt_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mxt_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 424w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 848w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1272w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mxt_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png" width="1201" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1201,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:325504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mxt_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 424w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 848w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1272w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#128161;Opinion</h1><div><hr></div><h2>&#128073; The Infamous Rise of Notebook Engineers!</h2><p><strong>Daniel Beach</strong> has penned an intriguing article critiquing the growing use of notebooks in data engineering. He argues that many engineers misuse notebooks due to either a lack of technical skills or encouragement from vendors, like Databricks, promoting questionable practices. While notebooks are valuable for data analysts and scientists engaged in iterative data analysis, Daniel contends they are ill-suited for robust data engineering lifecycles. This misuse leads to poor coding standards, insufficient testing, and inadequate deployment practices. <strong><a href="https://dataengineeringcentral.substack.com/p/the-rise-of-the-notebook-engineer">&#8212;&gt; Read More</a></strong></p><h2>&#128073; The Analytics Personas in Business</h2><p><strong>Tristan Handy</strong>, the founder and CEO of <strong>dbt Labs</strong>, wrote an insightful piece about the key analytics personas in business. He critiques the common approach of treating the analytical process like an assembly line, which often fails to deliver significant insights and ROI. Handy suggests that the personas&#8212;primarily engineers, analysts, and decision-makers&#8212;should be seen as interchangeable "hats" that team members can wear when needed, while still maintaining their primary roles. This flexible approach fosters collaboration and enhances the overall effectiveness of the analytics process. <strong><a href="https://roundup.getdbt.com/p/analytics-personas">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128225; Open Source News</h1><div><hr></div><h3>&#128073; Apache Airflow 2.10 Release</h3><p><strong>Apache Airflow 2.10</strong> was released recently, introducing exciting new features such as support for multiple executors within a single Airflow environment. This allows users to assign different executors, like <code>LocalExecutor</code> and <code>CeleryExecutor</code>, to individual DAGs and even specific tasks. There are also numerous enhancements to Datasets and the UI, which are worth exploring. <strong><a href="https://airflow.apache.org/blog/airflow-2.10.0/">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Development of a New PostgreSQL DuckDB Extension</h3><p><strong>MotherDuck</strong> announced <strong><a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a></strong>, an open-source Postgres extension that embeds the <strong>DuckDB</strong> engine into the Postgres database for running analytical queries on Postgres data. This is a significant step towards easily transforming a popular OLTP system into an HTAP system using an embedded OLAP engine. The development looks promising, with multiple companies such as Microsoft, Neon, and Hydra joining the effort. The beta version has been released recently.  <strong><a href="https://motherduck.com/blog/pgduckdb-beta-release-duckdb-postgres/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Ibis's Default Backend Change</h3><p>The <strong>Ibis</strong> DataFrame library project announced that it will drop the <strong>Pandas</strong> and <strong>Dask</strong> backends in favour of making <strong>DuckDB</strong> its default backend. This decision is due to DuckDB's ease of installation, impressive speed, and strong support within the Python ecosystem. <strong><a href="https://ibis-project.org/posts/farewell-pandas/">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128736; Practical Data Engineering</h1><div><hr></div><h3>&#128073;   Apache Airflow Alerting Techniques</h3><p>The Google Data Analytics blog has provided a comprehensive overview of the alerting hierarchy in <strong>Apache Airflow</strong>, ranging from the top DAG level down to the individual task instance level. It details various alerting mechanisms that can be used to monitor the state of DAG runs and receive notifications about potential failures. The alerting techniques discussed are applicable not only to Google's Cloud Composer managed Airflow service but also to other Airflow deployments. <strong><a href="https://cloud.google.com/blog/products/data-analytics/apache-airflow-hierarchy-and-alerting-options-with-cloud-composer/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Best Practices for Optimising Airflow</h3><p>The AWS blog has covered comprehensive strategies for optimising cost and performance in its Apache Airflow managed service, <strong>Amazon MWAA</strong>. Right-sizing remains crucial for achieving a balanced price-performance ratio in managed services. Amazon MWAA also supports auto-scaling, which can aid in this optimisation. The blog offers additional useful techniques for optimising DAG code to ensure that DAGs remain healthy, efficient, and scalable. These techniques can be applied to any Airflow deployment setup. <strong><a href="https://aws.amazon.com/blogs/big-data/optimize-cost-and-performance-for-amazon-mwaa/">&#8212;&gt; Read More</a></strong></p><p>Speaking of Airflow&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZzhX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZzhX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 424w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 848w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1272w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png" width="500" height="541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18654,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZzhX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 424w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 848w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1272w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#9881;&#65039; Technical Deep Dive</h1><div><hr></div><h3>&#128073; History and Evolution of Block Storage Services at AWS</h3><p>Another fascinating story on the <em><strong>All Things Distributed</strong></em> blog explores the evolution of block storage services offered by Amazon Web Services (AWS). Written by one of AWS's leading engineers, it highlights key milestones in the development of <strong>Elastic Block Store (EBS)</strong>, showcasing enhancements in performance, scalability, and continuous innovation. <strong><a href="https://www.allthingsdistributed.com/2024/08/continuous-reinvention-a-brief-history-of-block-storage-at-aws.html">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  The Internals of Apache Parquet </h3><p><strong>Vu</strong> has authored an insightful article on the internals of <strong>Parquet</strong>, the most popular cloud file format for data lakes. For those working with data lakes, understanding the design and architecture of common serialisation formats is invaluable for optimising storage and queries. <strong><a href="https://blog.det.life/i-spent-8-hours-learning-parquet-heres-what-i-discovered-97add13fb28f)">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  The Future of Distributed Systems and Their Storage Backend</h3><p> A great article by <strong>Colin Breck</strong> on the future of distributed systems and highlighting current challenges, and major trends such as acceleration of object store adoption as the main storage backend abstraction for many analytical and transactional database systems. <strong><a href="https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128170; Skill Up</h1><div><hr></div><p>In a recent <strong><a href="https://www.linkedin.com/posts/alirezasadeghi_weekend-caffeinated-insights-book-recommendations-activity-7255889495051419648-Oxvm?utm_source=share">LinkedIn post</a></strong>, I shared my top book recommendations for learning data engineering fundamentals. The feedback was overwhelmingly positive, with many comments emphasising the importance of selecting a good book and focusing on mastering the fundamentals. I have personally read all these books in recent years and have gained a lot from them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T49x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T49x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 424w, https://substackcdn.com/image/fetch/$s_!T49x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 848w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1272w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T49x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png" width="1036" height="1578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/834a0814-413e-412a-a13c-e0d272134026_1036x1578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1578,&quot;width&quot;:1036,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1121601,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!T49x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 424w, https://substackcdn.com/image/fetch/$s_!T49x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 848w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1272w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>&#128073;  DataCamp&#8217;s Free Week</h3><p><strong>DataCamp</strong> is offering free access to its entire platform and all courses for a week, from November 4 to 10. This is a great opportunity to explore their courses and enhance your skills in the coming week! <strong><a href="https://www.datacamp.com/blog/datacamp-free-access-week">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128270; Case Studies</h1><div><hr></div><h3>&#128073; Cost-effective Data Analytics Using Deterministic Sampling</h3><p><strong>Meta</strong> has shared valuable insights into its approach for achieving cost-effective data analytics through the use of deterministic sampling. This strategy is designed to balance the cost versus value trade-off, especially as data volumes and computation costs continue to rise exponentially. By employing deterministic sampling, Meta aims to reduce the overall cost and complexity of analytics without compromising the quality of insights. <strong><a href="https://medium.com/@AnalyticsAtMeta/scaling-analytics-instagram-the-power-of-deterministic-sampling-8ee7332d77ae">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Uber's New Declarative Batch ETL Framework</h3><p><strong>Uber</strong> has developed a modular declarative batch ETL framework called <em><strong>Sparkle</strong></em>, which leverages Apache Spark as the compute engine. Sparkle simplifies and standardises ETL pipeline development by allowing users to focus on expressing business logic as a sequence of transformation modules in SQL or Java/Scala/Python. It includes embedded unit testing and marks Uber's transition of all its batch ETL pipelines from Hive to Spark in 2023. <strong><a href="https://www.uber.com/blog/sparkle-modular-etl/">&#8212;&gt; Read more</a></strong></p><h3>&#128073; Self-service Kafka Platform Development Journey</h3><p><strong>Doordash</strong> has shared their journey in developing a self-service Kafka platform, aimed at addressing the challenges of managing Kafka infrastructure efficiently. This initiative was driven by the need to simplify the management of Kafka topics and resources, which was previously hindered by the use of low-level configuration management tools like Terraform. <strong><a href="https://careers.doordash.com/blog/doordash-engineers-with-kafka-self-serve/">&#8212;&gt; Read more</a></strong></p><p></p><div><hr></div><h1> &#128227; Vendors News &amp; Announcements</h1><div><hr></div><h3>&#128073; Continuous Queries on Data Warehouse Systems</h3><p><strong>Google BigQuery</strong> has introduced a significant new feature called <em><strong>BigQuery continuous queries</strong></em>, currently available in Preview. This feature transforms BigQuery from a batch system into an event-driven streaming pipeline, leveraging the concept of <strong>Stream-Table Duality</strong>. It allows for the continuous ingestion of new events as data is loaded into BigQuery, enabling use cases like event-driven data processing, continuous record replication to a pub/sub queue or other streaming storage systems, real-time ML model integration, and Reverse ETL use cases. <strong><a href="https://cloud.google.com/blog/products/data-analytics/bigquery-continuous-queries-makes-data-analysis-real-time/">&#8212;&gt; Read More</a></strong></p><p>The Confluent blog has also <a href="https://www.confluent.io/blog/streaming-bigquery-data-into-confluent-with-continuous-queries/">published</a> an article on leveraging this feature to stream data from BigQuery into the Confluent platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S1ZP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S1ZP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 424w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 848w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1272w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png" width="1456" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;BQ-image1&quot;,&quot;title&quot;:&quot;BQ-image1&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="BQ-image1" title="BQ-image1" srcset="https://substackcdn.com/image/fetch/$s_!S1ZP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 424w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 848w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1272w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>&#128073; New Google Managed Apache Kafka Service</h3><p>Google has also announced a new managed service, Google Cloud Managed Service for <strong>Apache Kafka</strong>. This service abstracts the complexities of deploying and managing a Kafka cluster, offering features like security management, full management of brokers and storage, and automatic horizontal and vertical scaling. It also includes automatic storage tiering and lifecycle management to offload cold data to unlimited cloud storage. <strong><a href="https://cloud.google.com/blog/products/data-analytics/new-managed-service-for-apache-kafka/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Introduction of Conditional Writes on AWS S3</h3><p>AWS recently announced the introduction of "<em><strong>Conditional Writes</strong></em>" on <strong>S3</strong>, marking a significant advancement in enhancing the reliability and efficiency of data operations, especially for distributed applications. This feature ensures that writes occur only if certain conditions are met, reducing the risk of unintentional data overwrites. It allows multiple clients to read and write to the same object without conflicts or concerns about overwriting each other's data. <strong><a href="https://www.infoq.com/news/2024/08/amazon-s3-conditional-writes/">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#127909; Conferences &amp; Events</h1><div><hr></div><h3>&#128073; Thinking Like an Architect</h3><p><strong>Gregor Hohpe</strong> delivered an insightful presentation titled "Thinking Like an Architect" at <strong>QCon London 2024</strong>. If you're interested in or work with data architecture, his talk provides valuable insights worth exploring. <strong><a href="https://www.infoq.com/presentations/architect-lessons/">&#8212;&gt; Watch</a></strong></p><h3>&#128073; Carnegie Mellon University's Intro to Database Systems Course - Fall 2024</h3><p>The Fall 2024 session of Carnegie Mellon University's renowned "<strong>Intro to Database Systems</strong>" course was commenced in August. You can follow along with the course through recorded lectures available on their <strong><a href="https://www.youtube.com/playlist?list=PLSE8ODhjZXjYDBpQnSymaectKjxCy6BYq">YouTube channel</a>.</strong></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Building a High-Performance Data Pipeline Using DuckDB]]></title><description><![CDATA[Using DuckDB to Serialise, Transform, and Aggregate Data in Data Lakes]]></description><link>https://www.pracdata.io/p/building-data-pipeline-using-duckdb</link><guid isPermaLink="false">https://www.pracdata.io/p/building-data-pipeline-using-duckdb</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 13 Oct 2024 06:06:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>DuckDB, a high-performance, embeddable analytical engine, has been generating a lot of interest due to its lightweight setup and powerful capabilities. </p><p>In my previous article, <strong><a href="https://practicaldataengineering.substack.com/p/duckdb-beyond-the-hype">DuckDB Beyond the Hype</a></strong>, I explored its various use cases and briefly demonstrated how it can be used in data engineering and data science workflows.</p><p>One use case that particularly resonated with readers was using DuckDB for data transformation and serialisation on data lakes. Inspired by some readers feedback, I decided to write this follow-up article to dive deeper into this use case and provide a full code example.</p><p>In this article, I present a high-level use case with sample code, demonstrating how to move data between different zones in a data lake, using DuckDB as the compute engine. </p><p>To keep the focus on the core concepts, only brief code snippets are included, but the full implementation can be found on <strong><a href="https://github.com/pracdata/duckdb-pipeline">GitHub</a></strong> for those who want to dive deeper.</p><h1>Project Overview</h1><p>For simplicity, I implemented the project as a pure Python application, with minimal dependencies. The only external dependency is a cloud object store for our data lake implementation, which doesn&#8217;t have to be AWS S3 specifically&#8212;it can be any S3-compatible object store. Alternatively, you can use <a href="https://github.com/localstack/localstack">LocalStack</a> to emulate the data lake locally on your machine.</p><p>The use case we&#8217;ll be exploring involves incrementally collecting the Github Archive datasets, which provide a full record of GitHub activities for public repositories, and enabling analytics on top of that data.</p><p></p><h1>Data Lake Architecture</h1><p>We will a multi-tier architecture, also referred to as Medallion architecture for this exercise. The Medallion architecture is a data lake design pattern that organises data into three zones:</p><ul><li><p><strong>Bronze Zone</strong>: Containing raw, unprocessed data ingested from various sources. </p></li><li><p><strong>Silver Zone</strong>:  Containing cleaned, conformed and potentially modeled data.</p></li><li><p><strong>Gold Zone</strong>: Containing aggregated and curated data ready for reporting, dashboards, and advanced analytics.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!la72!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!la72!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 424w, https://substackcdn.com/image/fetch/$s_!la72!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 848w, https://substackcdn.com/image/fetch/$s_!la72!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 1272w, https://substackcdn.com/image/fetch/$s_!la72!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!la72!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png" width="1127" height="766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1127,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155878,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!la72!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 424w, https://substackcdn.com/image/fetch/$s_!la72!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 848w, https://substackcdn.com/image/fetch/$s_!la72!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 1272w, https://substackcdn.com/image/fetch/$s_!la72!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64f5b4f-1a1e-4df9-95d6-18e46f6532df_1127x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Datalake Architecture</figcaption></figure></div><p>By maintaining this multi-zone architecture (<strong>Bronze &#8594; Silver &#8594; Gold</strong>), we ensure access to data at various stages of processing&#8212;ranging from raw data for detailed analysis to aggregated data for quick insights.  This flexibility&nbsp;allows us to meet&nbsp;a&nbsp;wide range of analytical needs&nbsp;while optimising for&nbsp;both&nbsp;storage&nbsp;and query&nbsp;performance.</p><h2>Partitioning Scheme</h2><p>Each zone will use an appropriate partitioning scheme to optimise data ingestion, query performance and improve overall efficiency.</p><p>Typically, the partitioning scheme aligns with the data ingestion and transformation frequency for batch workloads. This approach ensures that the pipeline remains deterministic at the partition level. </p><p>In other words, there are no overlaps between individual job runs, and if a pipeline fails or an anomaly is detected in the data, we can safely rerun the job for that specific period. The pipeline will replace the affected partition and its contents, avoiding any negative side effects, data duplication or inconsistencies.</p><p>This approach is often referred to as <strong>functional data processing using immutable partitions</strong>. It was popularised by Maxim Beauchemen, who introduced this powerful data engineering pattern in a highly regarded <strong><a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a">blog post</a></strong> back in 2018.</p><p>Following the functional data processing pattern, we&#8217;ll ingest data hourly from the source and partition it by both day and hour in the Bronze Zone. Each hourly job writes atomically to its designated partition, ensuring data integrity at this level of granularity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wKBd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wKBd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 424w, https://substackcdn.com/image/fetch/$s_!wKBd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 848w, https://substackcdn.com/image/fetch/$s_!wKBd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 1272w, https://substackcdn.com/image/fetch/$s_!wKBd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wKBd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png" width="1283" height="797" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:797,&quot;width&quot;:1283,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111533,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wKBd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 424w, https://substackcdn.com/image/fetch/$s_!wKBd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 848w, https://substackcdn.com/image/fetch/$s_!wKBd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 1272w, https://substackcdn.com/image/fetch/$s_!wKBd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39fac608-9805-4e3e-b0b4-ee45eea76da7_1283x797.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>We&#8217;ll apply the same partitioning strategy in the Silver Zone, since the transformation from Bronze to Silver will also run on an hourly basis.</p><p>In the Gold Zone, partitioning will be done only by day, since the aggregation job runs on daily internal. The daily job will produce results for the previous day's data from the Silver Zone.</p><p>This approach strikes the right balance between performance and storage&#8212;using fine-grained partitions in the early stages and higher-level aggregation in the final layer.</p><p>The overall partitioning scheme for the three zones is illustrated in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ggGm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ggGm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 424w, https://substackcdn.com/image/fetch/$s_!ggGm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 848w, https://substackcdn.com/image/fetch/$s_!ggGm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 1272w, https://substackcdn.com/image/fetch/$s_!ggGm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ggGm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png" width="1358" height="679" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:1358,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173839,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ggGm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 424w, https://substackcdn.com/image/fetch/$s_!ggGm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 848w, https://substackcdn.com/image/fetch/$s_!ggGm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 1272w, https://substackcdn.com/image/fetch/$s_!ggGm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f54a0f-2ef4-4e65-a60d-99ca5a228482_1358x679.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Data Lake Partitioning Scheme</figcaption></figure></div><p></p><h1>Data Pipeline Architecture</h1><p>The end-to-end data pipeline is broken down into three key steps:</p><p><strong>Step #1</strong> - Ingest the hourly <strong>GitHub Archive</strong> batch dataset from the <a href="http://gharchive.org">gharchive.org</a> website over HTTP and load it into the Bronze  Zone of our data lake.</p><p><strong>Step #2</strong> - Run an hourly transformation pipeline to clean and serialise the JSON files which loads the results into the Silver Zone.</p><p><strong>Step #3</strong> - Run a daily transformation pipeline to aggregate the data from the previous day and store it in the Gold Zone.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6-6X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6-6X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 424w, https://substackcdn.com/image/fetch/$s_!6-6X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 848w, https://substackcdn.com/image/fetch/$s_!6-6X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!6-6X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6-6X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png" width="1456" height="804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316888,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6-6X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 424w, https://substackcdn.com/image/fetch/$s_!6-6X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 848w, https://substackcdn.com/image/fetch/$s_!6-6X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!6-6X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88b4b34-f73e-436c-bf2f-5659eb9126eb_2067x1141.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Data Pipeline Architecture</figcaption></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>Data Source Exploration</h1><p>Before we jump into building our data pipeline, we should explore and learn about the data source, its characteristics and the shape of data.</p><p>A great way for data exploration is using Jupyter. DuckDB can help with exploring the remote data without having to download the entire dataset locally.</p><p>The source data are available from <a href="http://gharchive.org">gharchive.org</a> and the data dumps are made available on different intervals such as hourly and daily. </p><p>The following example demonstrates how to analyse a sample <em>gharchive</em> dump file using DuckDB. With DuckDB, you can define a <em>virtual</em> table directly over a URL, enabling you to query and analyse the data without the need to download the file locally first.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PqcG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PqcG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 424w, https://substackcdn.com/image/fetch/$s_!PqcG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 848w, https://substackcdn.com/image/fetch/$s_!PqcG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 1272w, https://substackcdn.com/image/fetch/$s_!PqcG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PqcG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png" width="1250" height="715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PqcG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 424w, https://substackcdn.com/image/fetch/$s_!PqcG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 848w, https://substackcdn.com/image/fetch/$s_!PqcG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 1272w, https://substackcdn.com/image/fetch/$s_!PqcG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37093d7f-2f95-45fe-a02f-6f5a0ba7aa3d_1250x715.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1>Data Ingestion - &#129353; Bronze Zone</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dT1t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dT1t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 424w, https://substackcdn.com/image/fetch/$s_!dT1t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 848w, https://substackcdn.com/image/fetch/$s_!dT1t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!dT1t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dT1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png" width="1456" height="1002" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1002,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:226888,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dT1t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 424w, https://substackcdn.com/image/fetch/$s_!dT1t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 848w, https://substackcdn.com/image/fetch/$s_!dT1t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!dT1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b9c8b4f-7fb5-44d2-a6cb-64e8524cd03c_1658x1141.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 1 - Data Ingestion</figcaption></figure></div><p></p><p>To implement our data ingestion pipeline, we need to identify the interface, protocol and data type of the source system, as this will guide our approach. Here's the breakdown for this use case:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P1FB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P1FB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 424w, https://substackcdn.com/image/fetch/$s_!P1FB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 848w, https://substackcdn.com/image/fetch/$s_!P1FB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 1272w, https://substackcdn.com/image/fetch/$s_!P1FB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P1FB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png" width="924" height="398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:924,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48383,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P1FB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 424w, https://substackcdn.com/image/fetch/$s_!P1FB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 848w, https://substackcdn.com/image/fetch/$s_!P1FB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 1272w, https://substackcdn.com/image/fetch/$s_!P1FB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f4a6dc6-3db0-4706-8105-04f0b2c6a132_924x398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Based on the above requirements, we need to download hourly JSON compressed files from <em><strong>gharchive.org</strong></em> server over HTTP. </p><p>A straightforward approach in Python is to use the <em><strong>requests</strong></em> library to stream and buffer the file from the source, followed by the <em><strong>boto3</strong></em> library to upload the file to S3, publishing it to the Bronze zone of the data lake.</p><p>The Following is a simple example to collect and publish the data:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rtLH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rtLH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 424w, https://substackcdn.com/image/fetch/$s_!rtLH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 848w, https://substackcdn.com/image/fetch/$s_!rtLH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 1272w, https://substackcdn.com/image/fetch/$s_!rtLH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rtLH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png" width="1422" height="358" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:358,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58573,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rtLH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 424w, https://substackcdn.com/image/fetch/$s_!rtLH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 848w, https://substackcdn.com/image/fetch/$s_!rtLH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 1272w, https://substackcdn.com/image/fetch/$s_!rtLH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe5adeb-b146-418b-b332-e6802742a954_1422x358.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Data upload to S3:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lSR-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lSR-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 424w, https://substackcdn.com/image/fetch/$s_!lSR-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 848w, https://substackcdn.com/image/fetch/$s_!lSR-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 1272w, https://substackcdn.com/image/fetch/$s_!lSR-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lSR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png" width="1422" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lSR-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 424w, https://substackcdn.com/image/fetch/$s_!lSR-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 848w, https://substackcdn.com/image/fetch/$s_!lSR-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 1272w, https://substackcdn.com/image/fetch/$s_!lSR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdda0c0bf-725c-4fa9-a217-d701bf36c44d_1422x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While this example is suitable for testing, it's a basic implementation. To build a robust data pipeline we need to incorporate better parametrisation, error handling, and modularity.</p><p>For this, I&#8217;ve written a class called <code>data_lake_ingestor.py</code>, which fetches data from the GitHub Archive for a specific hour. It uses Python's <em><strong>requests</strong></em> library to download the compressed JSON file in memory. The data is then uploaded directly to the specified S3 bucket, with the S3 key based on the date and hour.</p><p>To execute the ingestion pipeline  all we need to do is passing a timestamp to the ingestion method <code>ingest_hourly_gharchive()</code> ,<code> </code>which will be used to determine the interval for the JSON file to collect and load from source.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K4oJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K4oJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 424w, https://substackcdn.com/image/fetch/$s_!K4oJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 848w, https://substackcdn.com/image/fetch/$s_!K4oJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 1272w, https://substackcdn.com/image/fetch/$s_!K4oJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K4oJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png" width="1442" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1442,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110985,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K4oJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 424w, https://substackcdn.com/image/fetch/$s_!K4oJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 848w, https://substackcdn.com/image/fetch/$s_!K4oJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 1272w, https://substackcdn.com/image/fetch/$s_!K4oJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07edff3-b877-4648-ba44-ec4f99c74c64_1442x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Managing Configurations and Secrets</h3><p>Configuration settings, such as AWS credentials and bucket names, are stored in a configuration file (<em>config.ini</em>), to keep the code free from any sensitive and static data.</p><pre><code><code>[aws]
s3_access_key_id = your_access_key_here
s3_secret_access_key = your_secret_key_here
s3_region_name = your_region_name_here
s3_endpoint_url = your_custom_endpoint_url_here

[datalake]
bronze_bucket = you_bronze_zone_bucket
silver_bucket = you_silver_zone_bucket
gold_bucket = you_gold_zone_bucket</code></code></pre><p>In the code, we utilise Python's <code>configparser</code> library to load these configurations into the class, as demonstrated in the private method below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bb0U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bb0U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 424w, https://substackcdn.com/image/fetch/$s_!Bb0U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 848w, https://substackcdn.com/image/fetch/$s_!Bb0U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 1272w, https://substackcdn.com/image/fetch/$s_!Bb0U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bb0U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png" width="1432" height="336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:336,&quot;width&quot;:1432,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61658,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bb0U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 424w, https://substackcdn.com/image/fetch/$s_!Bb0U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 848w, https://substackcdn.com/image/fetch/$s_!Bb0U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 1272w, https://substackcdn.com/image/fetch/$s_!Bb0U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0999cc1-31cf-42c1-bce1-de5475751945_1432x336.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><h1>Raw Data Serialisation - &#129352; Silver Zone</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XxGa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XxGa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 424w, https://substackcdn.com/image/fetch/$s_!XxGa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 848w, https://substackcdn.com/image/fetch/$s_!XxGa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!XxGa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XxGa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png" width="1456" height="1001" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1001,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:235882,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XxGa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 424w, https://substackcdn.com/image/fetch/$s_!XxGa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 848w, https://substackcdn.com/image/fetch/$s_!XxGa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!XxGa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac44546-62b2-4e86-8b0c-c0d09d073e21_1658x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 2 - Data Serialisatio </figcaption></figure></div><p></p><p>After ingesting the raw GitHub Archive data into our data lake&#8217;s Bronze layer, the next critical step in the pipeline is to clean and serialise this data, preparing it for the Silver layer. This is where DuckDB plays a key role, performing the necessary transformations within the data lake.</p><p>The transformation logic is encapsulated in the <code>DataLakeTransformer</code> class, located in <em>data_lake_transformer.py</em>. This class provides two primary methods: <code>serialise_raw_data()</code> for data cleaning and serialisation, and <code>aggregate_silver_data()</code> for aggregating the data.</p><p>Let&#8217;s take a closer look at the serialisation logic:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x9vy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x9vy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 424w, https://substackcdn.com/image/fetch/$s_!x9vy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 848w, https://substackcdn.com/image/fetch/$s_!x9vy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 1272w, https://substackcdn.com/image/fetch/$s_!x9vy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x9vy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png" width="1432" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:1432,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108207,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x9vy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 424w, https://substackcdn.com/image/fetch/$s_!x9vy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 848w, https://substackcdn.com/image/fetch/$s_!x9vy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 1272w, https://substackcdn.com/image/fetch/$s_!x9vy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bf85ce-f0df-4dcd-97ba-87d843b41c28_1432x542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This method performs&nbsp;several&nbsp;key steps:</p><p><strong>Source and Sink Configuration:</strong> It determines the source (Bronze) and sink (Silver) bucket names based on the parameters specified in the configuration file (<em>config.ini</em>).</p><p><strong>Data Loading:</strong> The logic for importing the raw JSON data into an in-memory table using DuckDB&#8217;s <a href="https://duckdb.org/docs/api/python/relational_api.html">Relational API</a> is defined in the following method:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1CJ8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1CJ8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 424w, https://substackcdn.com/image/fetch/$s_!1CJ8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 848w, https://substackcdn.com/image/fetch/$s_!1CJ8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 1272w, https://substackcdn.com/image/fetch/$s_!1CJ8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1CJ8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png" width="1428" height="294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:294,&quot;width&quot;:1428,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76931,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1CJ8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 424w, https://substackcdn.com/image/fetch/$s_!1CJ8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 848w, https://substackcdn.com/image/fetch/$s_!1CJ8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 1272w, https://substackcdn.com/image/fetch/$s_!1CJ8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cf23a0-5b6e-4e5f-88b7-ad0eaf9af69d_1428x294.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The method returns a <code>duckdb.DuckDBPyRelation</code> object, which acts as a relational reference to the in-memory table. This ensures that subsequent steps operate on the in-memory data, avoiding repeated reads from the source file.</p><p>A key detail here is the <code>ignore_errors=true</code> parameter. DuckDB infers the schema from the first few records, and in the case of large datasets with inconsistent schemas (e.g., extra nested attributes in some records), errors may occur. </p><p>By setting <code>ignore_errors=true</code>, DuckDB skips over records that don&#8217;t match the inferred schema, which is efficient for our use case, where we don&#8217;t need deep, optional attributes found in some records. Alternatively, we could scan more records or provide an explicit schema, but that would introduce significant overhead for the large files we are processing.</p><h3>Data Modeling</h3><p>Before we serialise the raw data into Parquet format we need to perform a data modeling exercise and only select the attributes we are interested in.  For this we can use DuckDB SQL to implement the data modeling logic as shown in the following method. </p><p>The result of the SQL query encapsulated in the method below, is also stored in an in-memory table, ensuring that multiple future calls do not re-execute the SQL logic.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_uNJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_uNJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 424w, https://substackcdn.com/image/fetch/$s_!_uNJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 848w, https://substackcdn.com/image/fetch/$s_!_uNJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 1272w, https://substackcdn.com/image/fetch/$s_!_uNJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_uNJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png" width="1430" height="840" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:840,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_uNJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 424w, https://substackcdn.com/image/fetch/$s_!_uNJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 848w, https://substackcdn.com/image/fetch/$s_!_uNJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 1272w, https://substackcdn.com/image/fetch/$s_!_uNJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4219f776-ef66-4a00-969b-e0df4580cf3b_1430x840.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>DuckDB offers powerful features for working with nested JSON files, such as those found in the GitHub Archive dataset. One of the key advantages is the ability to use dot notation to access nested attributes directly, as well as functions like <code>unnest()</code> to fully unpack nested structures in your queries. </p><p>For example, if we wanted to flatten and extract all the attributes within the <code>actor</code> object, we could do so with a simple query like:</p><pre><code>query=f"SELECT UNNEST(actor),.... FROM '{raw_dataset}'"</code></pre><p>This approach makes it easy to work with complex, deeply nested data while maintaining simplicity in your queries.</p><p><strong>Data Export:</strong> After the data has been cleaned, the result is written to the Silver layer in Parquet format using DuckDB&#8217;s engine:</p><pre><code>gharchive_clean_result.write_parquet(sink_path)</code></pre><p></p><h3>Serialisation Performance</h3><p>When running the serialisation process close to the data, and using DuckDB to handle the transformation, the entire process completes in under a minute. </p><p>This efficiency makes DuckDB an excellent choice for lightweight, in-place data transformations, especially when working with local or cloud-based object storage systems like S3.</p><pre><code>2024-10-01 15:41:04,365 - INFO - DuckDB - collect source data files: s3://datalake-bronze/gharchive/events/2024-10-01/15/*
100% &#9621;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9615;
2024-10-01 15:41:29,892 - INFO - DuckDB - clean data
2024-10-01 15:41:30,129 - INFO - DuckDB - serialise and export cleaned data to s3://datalake-silver/gharchive/events/2024-10-01/15/clean_20241001_15.parquet
100% &#9621;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9615;</code></pre><p></p><h3>Key Takeaways</h3><p>The use&nbsp;of DuckDB for&nbsp;this&nbsp;transformation process is&nbsp;a key design choice:</p><ul><li><p><strong>In-Memory Processing:</strong> DuckDB allows for efficient in-memory&nbsp;processing of the data, which&nbsp;is particularly&nbsp;useful for the typically large GitHub Archive datasets.</p></li><li><p><strong>SQL Interface:</strong> The use of SQL for data modeling provides a familiar and powerful interface for data transformations.</p></li><li><p><strong>Parquet Writing:</strong> DuckDB has very efficient Parquet reader and writer for fast and efficient serialisation of data from primitive data types such as JSON and CSV, while eliminating the need for intermediate&nbsp;steps&nbsp;or&nbsp;additional libraries.</p><p></p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Data Aggregation   - &#129351; Gold Zone</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YOX2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YOX2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 424w, https://substackcdn.com/image/fetch/$s_!YOX2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 848w, https://substackcdn.com/image/fetch/$s_!YOX2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!YOX2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YOX2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png" width="1456" height="1002" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/308deeff-6014-40da-8daa-322b70063efa_1658x1141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1002,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:236184,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YOX2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 424w, https://substackcdn.com/image/fetch/$s_!YOX2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 848w, https://substackcdn.com/image/fetch/$s_!YOX2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!YOX2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308deeff-6014-40da-8daa-322b70063efa_1658x1141.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 3 - Data Aggregation</figcaption></figure></div><p></p><p>After&nbsp;modeling&nbsp;and serialising our&nbsp;GitHub Archive raw data into&nbsp;the&nbsp;Silver zone, the&nbsp;next step in&nbsp;our data pipeline is to&nbsp;aggregate this&nbsp;data and publish&nbsp;it to the Gold zone on daily basis. </p><p>Here&#8217;s an overview of the method responsible for performing the daily aggregation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lSnj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lSnj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 424w, https://substackcdn.com/image/fetch/$s_!lSnj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 848w, https://substackcdn.com/image/fetch/$s_!lSnj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 1272w, https://substackcdn.com/image/fetch/$s_!lSnj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lSnj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png" width="1424" height="544" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:544,&quot;width&quot;:1424,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lSnj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 424w, https://substackcdn.com/image/fetch/$s_!lSnj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 848w, https://substackcdn.com/image/fetch/$s_!lSnj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 1272w, https://substackcdn.com/image/fetch/$s_!lSnj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d3400b-ead8-4bde-b232-3ae3362f3c20_1424x544.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This method encompasses several key steps:</p><p><strong>Source and Sink Configuration:</strong> It identifies the source (Silver) and sink (Gold) bucket names based on the configuration.</p><p><strong>Data Loading and Aggregation:</strong> The aggregation logic is defined by using SQL applied to a DuckDB virtual table that has been defined over Parquet files present in the Silver zone. This aggregation focuses on counting GitHub events by type (e.g., stars, pull requests), repository, and date, providing an aggregated view of GitHub activity within a daily time window.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XVEX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XVEX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 424w, https://substackcdn.com/image/fetch/$s_!XVEX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 848w, https://substackcdn.com/image/fetch/$s_!XVEX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 1272w, https://substackcdn.com/image/fetch/$s_!XVEX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XVEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png" width="1430" height="708" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:708,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127791,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XVEX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 424w, https://substackcdn.com/image/fetch/$s_!XVEX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 848w, https://substackcdn.com/image/fetch/$s_!XVEX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 1272w, https://substackcdn.com/image/fetch/$s_!XVEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30cdbc5-c493-4861-b254-6156024a88b8_1430x708.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <code>GROUP BY ALL</code> feature in DuckDB simplifies group by statements by removing the need to explicitly specify the columns.</p><p>As with the previous transformation step, we persist the result of this aggregation to an in-memory DuckDB table, returning it as a <code>DuckDBPyRelation</code> object to ensure that future calls do not re-execute the SQL logic.</p><p><strong>Data Export:</strong> The aggregated data is subsequently written to the Gold zone in Parquet format:</p><pre><code>gharchive_agg_result.write_parquet(sink_path)</code></pre><p></p><h3>Aggregation Performance</h3><p>The transformation pipeline on a cloud VM takes less than a minute to aggregate 24 Parquet files containing nearly 6 million records, and serialise the result into a Parquet file published in the Gold zone. </p><p>This efficiency underscores the capability of DuckDB to handle small to medium-scale data transformations quickly and effectively.</p><pre><code>2024-10-01 00:31:42,787 - INFO - DuckDB - aggregate silver data in s3://datalake-silver/gharchive/events/2024-10-01/*/*.parquet
100% &#9621;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9615;
2024-10-01 00:31:53,020 - INFO - DuckDB - export aggregated data to s3://datalake-gold/gharchive/events/2024-10-01/agg_20241001.parquet
100% &#9621;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9615;</code></pre><h3>Key Takeaways</h3><p>The use of&nbsp;DuckDB for this&nbsp;data aggregation process offers several advantages:</p><ul><li><p><strong>Efficient Processing:</strong> DuckDB's column-oriented storage and processing is well-suited for analytical&nbsp;queries and aggregations.</p></li><li><p><strong>SQL&nbsp;Interface:</strong> The use of&nbsp;SQL for aggregation in Python provides&nbsp;a familiar and powerful interface&nbsp;for complex&nbsp;data transformations.</p></li><li><p><strong>Efficient Parquet Integration:</strong> DuckDB's&nbsp;native&nbsp;support for Parquet&nbsp;files allows for efficient reading&nbsp;and writing of data&nbsp;in&nbsp;this format.</p></li></ul><p></p><h1>Orchestration &amp; Scheduling </h1><p>In a production environment, you would typically use a workflow orchestrator like Apache Airflow, Dagster or Prefect to manage the execution of the three pipelines discussed in this article.</p><p>However, since the goal here was to demonstrate how DuckDB can be used for data transformation, I&#8217;ve intentionally omitted any external orchestration or scheduling components. Instead, in the GitHub project, I&#8217;ve provided individual scripts for each pipeline along with instructions in the README on how to schedule them easily using cron.</p><p>That said, you can easily adapt the code to your preferred workflow orchestrator. For example, in Airflow, you could use a <code>PythonOperator</code> to call the functions for each pipeline step by importing the relevant class.</p><p>This code-first approach to pipeline development ensures that your business logic remains decoupled from workflow logic, making the pipelines flexible and easy to port to any Python-based orchestration tool.</p><h1>Interactive Data Analytics</h1><p>Once the data is prepared for analytics, DuckDB's efficient querying capabilities allow us to easily extract meaningful insights from the aggregated data, making it an excellent tool for interactive data analysis. </p><p>Following code snipped from a Jupyter notebook on my laptop demonstrates how to analyse the most-starred repositories on a specific day using the Parquet files stored in the Gold zone.</p><p>The code showcases DuckDB's Python API, highlighting its Pythonic data analysis capabilities, which are comparable to other DataFrame APIs like PySpark and Pandas.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LbNP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LbNP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 424w, https://substackcdn.com/image/fetch/$s_!LbNP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 848w, https://substackcdn.com/image/fetch/$s_!LbNP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 1272w, https://substackcdn.com/image/fetch/$s_!LbNP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LbNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png" width="782" height="684" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:684,&quot;width&quot;:782,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:132548,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LbNP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 424w, https://substackcdn.com/image/fetch/$s_!LbNP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 848w, https://substackcdn.com/image/fetch/$s_!LbNP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 1272w, https://substackcdn.com/image/fetch/$s_!LbNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456fc22-d1bc-4489-b03a-9932cf31b6a5_782x684.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1>Conclusion</h1><p>In this article, we explored how DuckDB can be leveraged for efficient batch data transformation and serialisation in a data lake architecture. We walked through the steps of ingesting raw GitHub Archive data, transforming it into structured formats, and aggregating the data for analysis using Medallion architecture (Bronze &#8594; Silver &#8594; Gold).</p><p>DuckDB&#8217;s in-memory processing, SQL-based transformation capabilities, and seamless Parquet integration make it an excellent choice for handling such datasets with high performance. </p><p>For those looking to replicate this approach, the full code examples are available on <a href="https://github.com/pracdata/duckdb-pipeline">GitHub</a>, allowing you to explore the potential of DuckDB in your own data transformation pipelines.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[DLD #3 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Open Catalog War, Latest Apache Kafka and Apache Flink Releases, Airflow Trigger Rules, Lakehouse File Formats and More.]]></description><link>https://www.pracdata.io/p/dld-3-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-3-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 22 Sep 2024 08:37:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uP6V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uP6V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uP6V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uP6V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uP6V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h1> &#10024; Featured - The New Open Catalog War</h1><div><hr></div><p>In July this year, <strong>Snowflake</strong> <strong><a href="https://www.snowflake.com/en/blog/polaris-catalog-open-source/">open-sourced</a></strong> its <strong>Polaris Catalog</strong> under the Apache 2.0 license, with plans to submit it to the Apache Incubator program. Polaris is a catalog service designed for Apache Iceberg but can extend to other major open formats as well.</p><p><strong>The question is:</strong> why does <strong>Apache Iceberg,</strong> which already has its own metadata layer, need a separate catalog service? </p><p>While <strong>Delta Lake</strong>, <strong>Apache Hudi</strong>, and <strong>Apache Iceberg</strong> open table formats  provide their own metadata, each query engine (such as Spark, Flink, Presto or Trino) must perform separate integrations for tasks like schema discovery and data operations. </p><p>An <em><strong>open</strong></em> <em><strong>unified catalog service</strong></em> like Polaris simplifies this by streamlining multi-engine <em><strong>interoperability</strong></em>. It also offers enhanced features like improved search, data discovery, tagging, and governance, including access control through a <em><strong>unified</strong></em>, <em><strong>open</strong></em>, and <em><strong>vendor-agnostic</strong></em> interface compatible with various storage engines. </p><p>Polaris has the potential to become the standard catalog service for data lakehouse platforms, much like <strong>Hive Metastore</strong> was for Hadoop-based systems. Currently, catalog options include proprietary tools like AWS Glue Catalog and Databricks&#8217; Unity Catalog, which was also open-sourced in June 2024. </p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UyWz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UyWz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 424w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 848w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UyWz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png" width="1456" height="899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:899,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:625076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UyWz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 424w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 848w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Snowflake&#8217;s decision to open-source Polaris may have been influenced by Databricks' move to open source <strong>Unity Catalog</strong>. While some may see this as a new &#8220;<strong>catalog war</strong>&#8221; driven by marketing strategies, as noted by <strong><a href="https://materializedview.io/p/data-lakehouse-catalog-reality-check">Chris</a></strong>, there&#8217;s still hope that these moves will lead to production-ready, open-source catalog services that can finally provide an alternative solution to Hive Metastore after all these years.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h1>&#128225; Open Source News</h1><div><hr></div><h3>&#128073;   Seamless Integration of dbt and Airflow</h3><p>Building and scheduling data pipelines using <strong>dbt</strong> models and workflow orchestration tools like <strong>Airflow</strong> has become a standard practice in data engineering for running transformation workflows. In response to the growing demand for seamless integration between the two systems, <strong>Astronomer</strong> developed a Python package called <strong><a href="https://github.com/astronomer/astronomer-cosmos">Cosmos</a></strong>. This package simplifies running dbt models within Airflow DAGs through a new Airflow task type, <code>DbtDag</code>. The latest version, 1.5.1, was released on July 17, so if you're using dbt and Airflow, it's definitely worth checking out. <strong><a href="https://www.astronomer.io/blog/airflow-dbt-next-chapter/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; What's New in Apache Kafka 3.8.0</h3><p>The release of <strong>Apache Kafka 3.8.0</strong> was recently announced, bringing several new features and improvements. This post on the official Apache Kafka site provides an overview of key updates, including support for compression levels, a new consumer rebalance protocol, and re-bootstrapping capabilities. <strong><a href="https://kafka.apache.org/blog#apache_kafka_380_release_announcement">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Apache Flink 1.20: New Features and the Road to Flink 2.0</h3><p><strong>Apache Flink 1.20</strong> was also recently released, and the Confluent blog highlights the major improvements and features in this update. Notable enhancements include improvements to the bucketing feature for <strong>Flink SQL</strong> tables, allowing users to specify the number of buckets in the <code>DISTRIBUTED BY</code> clause, and the introduction of Flink SQL <strong>Materialised tables</strong>, which are automatically refreshed in the background as data streams in. Additionally, there are various operational improvements. According to reports, this may be the last minor release before Flink 2.0. <strong><a href="https://www.confluent.io/blog/exploring-apache-flink-1-20-features-improvements-and-more/">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1>&#128736; Practical Data Engineering</h1><div><hr></div><h3>&#128073;  Mastering Airflow Trigger Rules</h3><p><strong>Astronomer</strong>, a managed Airflow service, has provided an overview of <strong>Airflow trigger rules</strong> with a visual guide to help new Airflow developers better understand and apply the right trigger rules in their DAGs. For new engineers, grasping all the trigger rules can be challenging, but it's a crucial aspect for effectively managing dependencies between upstream and downstream tasks. <strong><a href="https://www.astronomer.io/blog/understanding-airflow-trigger-rules-comprehensive-visual-guide/">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  A Radical Simplicity Approach to Data Engineering</h3><p>The<strong> tds </strong>blog recently shared a great post about the trade-off between simplicity and functionality in software projects, including data engineering. The author advocates for a philosophy of "<em><strong>Radical Simplicity</strong></em>," where simple, straightforward solutions are prioritised over complex ones. This resonates with me, as I believe complexity should only be introduced when absolutely necessary.  <strong><a href="https://towardsdatascience.com/radical-simplicity-in-data-engineering-86ec3d2bd71c">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Hands-On Guide: Installing and Integrating Polaris OSS</h3><p>If you're interested in installing and testing the latest <strong>Apache Polaris</strong> open-source release, <strong>Dremio</strong> has published a hands-on tutorial that guides you through the installation process and integration with Apache Spark and Apache Iceberg. <strong><a href="https://www.dremio.com/blog/getting-hands-on-with-polaris-oss-apache-iceberg-and-apache-spark/">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1>&#9881;&#65039; Technical Deep Dive</h1><div><hr></div><h3>&#128073;  Parquet vs ORC: Choosing the Right Format for Data Lakehouse</h3><p>In a recent Apache Hudi blog post, the author compares <strong>Parquet</strong> and <strong>ORC</strong>, two of the most popular serialisation frameworks for data lakes and open table formats. The post argues that Parquet delivers better performance for read-heavy, complex analytical use cases where query performance is crucial, while ORC offers a more balanced approach for both read and write performance with superior compression, making it a better fit for general-purpose data storage in Hudi. <strong><a href="https://hudi.apache.org/blog/2024/07/31/hudi-file-formats/">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Overview of Amazon MSK Tiered Storage</h3><p><strong>Amazon</strong> recently published a post, explaining how the new Kafka tiered storage in <strong>Amazon MSK</strong> (Managed Kafka Service) enhances scalability and resiliency. With the new decoupled storage and compute architecture, the system benefits from faster broker recovery, improved load balancing, and virtually unlimited scalability.  <strong><a href="https://aws.amazon.com/blogs/big-data/improve-apache-kafka-scalability-and-resiliency-using-amazon-msk-tiered-storage/">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1> &#128172; Community Discussions</h1><div><hr></div><p>There have been several widely discussed threads including <strong><a href="https://www.reddit.com/r/dataengineering/comments/1efb5xr/this_market_is_seriously_wack/">link</a></strong> and <strong><a href="https://www.reddit.com/r/dataengineering/comments/1edezgh/a_data_engineer_doing_power_bi_stuff/">link</a></strong> on Reddit last month about data engineering roles and the challenges of applying for Data Engineering jobs. Based on comments from both candidates and hiring managers (including CTOs), the situation seems two-fold. </p><p>On one hand, the market is flooded with unqualified candidates&#8212;often with minimal skills from low-quality bootcamps and online courses&#8212;making it hard for companies to find qualified engineers. On the other hand, there are vague and unclear job descriptions, leaving data engineers unsure of what is expected of them once hired. This has created frustration on both sides.</p><div><hr></div><h1> &#128270; Case Studies</h1><div><hr></div><h3>&#128073;   Evolution of Apache Flink Architecture at Airbnb</h3><p><strong>Airbnb</strong> published a post detailing the evolution of their <strong>Apache Flink architecture</strong>. Initially, they deployed Flink jobs on Hadoop YARN with Airflow as the scheduler in 2018. Today, they&#8217;ve moved to deploying Flink jobs on <strong>Kubernetes</strong>, eliminating the need for a job scheduler.  <strong><a href="https://medium.com/airbnb-engineering/apache-flink-on-kubernetes-84425d66ee11">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Pinterest's Adoption of StarRocks for Real-Time Analytics</h3><p><strong>Pinterest</strong> shared their recent adoption of <strong>StarRocks</strong>, a real-time OLAP engine, for their real-time analytics platform. They chose StarRocks for its features like support for standard SQL, joins, sub-queries, and materialized views&#8212;capabilities not readily available in other real-time OLAP engines like Druid. Back in 2021, Pinterest <strong><a href="https://medium.com/pinterest-engineering/pinterests-analytics-as-a-platform-on-druid-part-1-of-3-9043776b7b76">published</a></strong> details about managing a large Druid fleet with 2,000 nodes in a multi-cluster setup. <strong><a href="https://medium.com/pinterest-engineering/delivering-faster-analytics-at-pinterest-a639cdfad374">&#8212;&gt; Read More</a></strong></p><h3>&#128073; The Rise of New Real-time OLAP Engines</h3><p>On that note, while Apache Druid, Pinot, and ClickHouse have dominated the open-source real-time OLAP space in recent years, we&#8217;re now seeing increased adoption of newer engines like <strong>Apache Doris</strong> and <strong>StarRocks</strong>, the latter being a fork of Doris. For a detailed comparison between StarRocks and Doris, check out this <strong><a href="https://medium.com/starrocks-engineering/detailed-comparison-between-starrocks-and-apache-doris-81ddd34be527">blog post</a></strong> from StarRocks Engineering.</p><h3>&#128073; Uber&#8217;s Hadoop-to-Cloud Migration</h3><p><strong>Uber</strong>, which operates one of the largest on-premise <strong>Hadoop</strong> clusters, has recently begun migrating to the cloud, starting with a key architectural shift&#8212;replacing the HDFS file system with Google Cloud Storage, while still running the rest of their stack on IaaS. One of the challenges in migrating from Hadoop to the cloud is transitioning Hadoop&#8217;s security features, such as delegation tokens and Kerberos authentication, to Google Cloud&#8217;s token-based security. Uber discusses how they tackled these security migration challenges in this article. <strong><a href="https://www.uber.com/en-AU/blog/securing-hadoop-on-gcp/">&#8212;&gt;</a></strong><a href="https://www.uber.com/en-AU/blog/securing-hadoop-on-gcp/"> </a><strong><a href="https://www.uber.com/en-AU/blog/securing-hadoop-on-gcp/">Read More</a></strong></p><div><hr></div><h1> &#128227; Vendors News &amp; Announcements</h1><div><hr></div><h3>&#128073; Fivetran Integration with Snowflake&#8217;s Polaris Catalog</h3><p>Just days after Snowflake open-sourced the Polaris catalog service on GitHub, <strong>Fivetran</strong>, a leading SaaS provider for data integration, announced its upcoming integration with the newly open-sourced <strong>Polaris</strong> data catalog. This integration aims to develop a managed catalog solution for Fivetran&#8217;s <strong>Managed Data Lake Service</strong>. <strong><a href="https://www.fivetran.com/blog/unlock-catalog-interoperability-with-fivetran-and-polaris">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Databricks LakeFlow Connect for Automated Data Ingestion</h3><p><strong>Databricks</strong> announced the public preview of <strong><a href="https://www.databricks.com/product/data-ingestion">LakeFlow Connect</a></strong>, an automated incremental data ingestion service for sources like SQL Server and Salesforce. Built on <strong>Delta Live Tables</strong>, LakeFlow Connect enables incremental data ingestion using CDC (Change Data Capture). This marks another step by major vendors toward automating data engineering tasks. <strong><a href="https://www.databricks.com/blog/ingest-data-sql-server-salesforce-and-workday-lakeflow-connect">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Databricks Lakehouse Federation Across AWS, Azure, and GCP</h3><p><strong>Databricks</strong> also announced the general availability of <strong>Lakehouse Federation</strong> in Unity Catalog across AWS, Azure, and GCP last month. This mirrors the strategy of other top cloud vendors to offer a unified analytical platform with centralised data discovery and governance, providing a unified view of enterprise data across multiple storage engines and cloud platforms. <strong><a href="https://www.databricks.com/blog/announcing-general-availability-lakehouse-federation">&#8212;&gt; Read More</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TYor!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TYor!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 424w, https://substackcdn.com/image/fetch/$s_!TYor!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 848w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TYor!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png" width="1456" height="1549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1549,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209085,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TYor!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 424w, https://substackcdn.com/image/fetch/$s_!TYor!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 848w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>&#128073;  ClickHouse Acquisition of PeerDB for Real-Time Postgres Ingestion</h3><p><strong>ClickHouse, Inc.</strong> announced the acquisition of <strong><a href="https://peerdb.io/">PeerDB</a></strong>, a provider of Change Data Capture (CDC) for <strong>Postgres</strong> databases. This move aims to integrate and streamline real-time data ingestion from transactional databases like Postgres into the ClickHouse OLAP engine. <strong><a href="https://clickhouse.com/blog/clickhouse-welcomes-peerdb-adding-the-fastest-postgres-cdc-to-the-fastest-olap-database">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1> &#127909; Conferences &amp; Events</h1><div><hr></div><h3>&#128073; Kafka Current 2024 Key Notes</h3><p>The two-day <strong>Kafka Current 2024</strong> event, organised by Confluent, took place last week in Austin. Keynotes from both Day 1 and Day 2 have already been published on YouTube: <strong><a href="https://www.youtube.com/watch?v=Sn6fVsOzrSU">Keynote Day 1</a> | <a href="https://www.youtube.com/watch?v=ccupkhcLioM">Keynote Day 2</a></strong></p><p></p><h3>&#128073;  Open Source Data Summit Virtual Conference</h3><p>The <strong>Open Source Data Summit</strong> Virtual Conference will be held on October 2nd. If you're interested, you can register for free at <strong><a href="https://opensourcedatasummit.com/">opensourcedatasummit.com</a></strong>. The event will feature numerous discussions on data lakehouses and the role of open table formats in modern data architectures.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[DuckDB Beyond the Hype]]></title><description><![CDATA[A Powerful Addition to the Data Scientist's and Data Engineer's Toolbox]]></description><link>https://www.pracdata.io/p/duckdb-beyond-the-hype</link><guid isPermaLink="false">https://www.pracdata.io/p/duckdb-beyond-the-hype</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sat, 14 Sep 2024 21:00:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uoN5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uoN5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 424w, https://substackcdn.com/image/fetch/$s_!uoN5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 848w, https://substackcdn.com/image/fetch/$s_!uoN5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 1272w, https://substackcdn.com/image/fetch/$s_!uoN5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uoN5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117753,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uoN5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 424w, https://substackcdn.com/image/fetch/$s_!uoN5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 848w, https://substackcdn.com/image/fetch/$s_!uoN5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 1272w, https://substackcdn.com/image/fetch/$s_!uoN5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6baa08d2-cce1-4cd3-ba5b-0358af0ff3ee_1676x952.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>After years of working in the data space, you witness the rise and fall of numerous tools and products. Over time, you become more skeptical of the latest shiny tools, often dismissing them as marketing-driven hype, until you reach a point where they can no longer be ignored.</p><p><strong>DuckDB</strong> was one such tool for me. Initially, I dismissed it, but eventually, the buzz around it became too loud to ignore. That's when curiosity got the better of me, and I decided to dive in.</p><p>But like some new technologies, I initially struggled to grasp its core functionality. The <strong><a href="https://duckdb.org/">official website</a></strong> describes it as a "<em><strong>fast in-process analytical database</strong></em>" but this wasn't very enlightening. </p><p>Then I came across the phrase &#8220;<em><strong>SQLite for OLAP</strong></em>&#8221; which helped me understand it better, though I was still unsure about its place in my mental data stack.</p><p>It wasn&#8217;t until I explored discussions in data communities like Reddit that I realised DuckDB, like many open-source projects, had evolved beyond its creators' original vision. That&#8217;s when I knew it was time to test it out for myself and decide if the hype was justified.</p><p></p><h1>WHAT DUCKDB IS</h1><div><hr></div><p>In my findings, DuckDB is a hybrid engine with a range of diverse functionalities. Let's explore these features in more detail.</p><h2>An Embeddable &amp; Portable Database</h2><p>DuckDB is an "<em><strong>embeddable</strong></em>" database system. This is what the creators of DuckDB described it as, in a <strong><a href="https://duckdb.org/pdf/SIGMOD2019-demo-duckdb.pdf">SIGMOD conference paper</a></strong> back in 2019.</p><p>Like <strong>SQLite</strong>, it allows you to store your data in a single <em><strong>.duckdb</strong></em> database file, making it easily portable within other projects.</p><p>Besides being embeddable, DuckDB operates much like traditional DBMS systems, but without requiring a long-running server process like MySQL or PostgreSQL. There&#8217;s no need for starting a database server and establish a socket connection to it over IP/hostname and port.</p><p>All you need to do is point to where the single DuckDB database file is stored, or start an in-memory session without even requiring to use a physical database.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-tFz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-tFz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 424w, https://substackcdn.com/image/fetch/$s_!-tFz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 848w, https://substackcdn.com/image/fetch/$s_!-tFz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 1272w, https://substackcdn.com/image/fetch/$s_!-tFz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-tFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png" width="1442" height="340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:1442,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53659,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-tFz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 424w, https://substackcdn.com/image/fetch/$s_!-tFz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 848w, https://substackcdn.com/image/fetch/$s_!-tFz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 1272w, https://substackcdn.com/image/fetch/$s_!-tFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7357c2c2-54bf-4ba5-bdd2-429a05c176fb_1442x340.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><h2>A Columnar OLAP Database</h2><p>DuckDB is a columnar database, making it highly efficient for running analytical queries. </p><p>It supports two main storage formats: its native <em><strong>.duckdb</strong></em> format or open-standard file formats like Parquet, which DuckDB reads and writes with impressive efficiency, including support for predicate pushdown.</p><p>Internally, DuckDB uses a row-columnar structure. Data is sliced into <em>row-groups</em> containing 120,000 records, and within each group, columns are stored separately and compressed&#8212;similar to popular binary formats like Parquet and ORC.</p><p>This architecture allows you to efficiently store and query analytical datasets on a local machine or a single server without the overhead of a full-fledged OLAP database engine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hckv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hckv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 424w, https://substackcdn.com/image/fetch/$s_!hckv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 848w, https://substackcdn.com/image/fetch/$s_!hckv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 1272w, https://substackcdn.com/image/fetch/$s_!hckv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hckv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png" width="1456" height="943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:943,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288601,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hckv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 424w, https://substackcdn.com/image/fetch/$s_!hckv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 848w, https://substackcdn.com/image/fetch/$s_!hckv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 1272w, https://substackcdn.com/image/fetch/$s_!hckv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1697ec10-038a-4278-90cd-607ef890d50c_1472x953.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DuckDB&#8217;s internal data structure</figcaption></figure></div><p></p><h2>Interoperable SQL-Powered DataFrame</h2><p>The DuckDB&#8217;s Python library is essentially a DataFrame on steroids. It integrates seamlessly with popular DataFrame libraries like <strong>Pandas</strong> and <strong>Polars</strong>, allowing efficient in-memory operations.</p><p>What sets DuckDB apart is its ability to run SQL queries directly on Python DataFrames. You can query <strong>Pandas</strong>, <strong>Polars</strong> and <strong>Apache</strong> <strong>Arrow</strong> DataFrame objects as though they were SQL tables.</p><p>For some frameworks such as Apache Arrow, DuckDB uses <strong><a href="https://duckdb.org/2021/12/03/duck-arrow.html">zero-copy mode</a></strong> for fast conversion. Using zero-copy mode no serialisation is required for translating the in-memory objects between different representations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H8HN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H8HN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 424w, https://substackcdn.com/image/fetch/$s_!H8HN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 848w, https://substackcdn.com/image/fetch/$s_!H8HN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 1272w, https://substackcdn.com/image/fetch/$s_!H8HN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H8HN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:327043,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H8HN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 424w, https://substackcdn.com/image/fetch/$s_!H8HN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 848w, https://substackcdn.com/image/fetch/$s_!H8HN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 1272w, https://substackcdn.com/image/fetch/$s_!H8HN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a65eda8-c031-4e25-a352-aeed2f27b3f1_2022x1148.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DuckDB&#8217;s interoperability between different DataFrame APIs</figcaption></figure></div><p></p><p>You can join data from different objects, such as a Polars DataFrame, and a Pandas DataFrame, in a single SQL query. The results can be stored back into a DuckDB database or exported to file formats like Parquet on external storage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AhK1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AhK1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 424w, https://substackcdn.com/image/fetch/$s_!AhK1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 848w, https://substackcdn.com/image/fetch/$s_!AhK1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 1272w, https://substackcdn.com/image/fetch/$s_!AhK1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AhK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png" width="1456" height="501" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:501,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120783,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AhK1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 424w, https://substackcdn.com/image/fetch/$s_!AhK1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 848w, https://substackcdn.com/image/fetch/$s_!AhK1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 1272w, https://substackcdn.com/image/fetch/$s_!AhK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe20aafa-2faa-48ec-8d4b-1d360acf81e3_1836x632.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>A Federated Query Engine</h2><p>DuckDB offers a simple, efficient way to query external data systems through its <strong><a href="https://duckdb.org/docs/extensions/overview">extensions</a></strong>. Similar to distributed query engines such as <strong>Athena</strong>, <strong>Presto</strong> or <strong>Trino</strong>, it allows seamless joins across various external data sources.</p><p>You can directly query DBMS systems like MySQL and Postgres, open data files like JSON, CSV, and Parquet files stored in cloud storage systems like Amazon S3, and modern open table formats like <strong>Apache</strong> <strong>Iceberg</strong> and <strong>Delta Lake.</strong></p><p>Though DuckDB doesn't have the concept of <em><strong>external tables</strong></em> found in systems like Hive or Redshift, you can still create persistent <strong>Views</strong> over the external tables or data files, like a <em>read-only external table</em>.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vohW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vohW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 424w, https://substackcdn.com/image/fetch/$s_!vohW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 848w, https://substackcdn.com/image/fetch/$s_!vohW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 1272w, https://substackcdn.com/image/fetch/$s_!vohW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vohW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png" width="1456" height="934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:934,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:254982,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vohW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 424w, https://substackcdn.com/image/fetch/$s_!vohW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 848w, https://substackcdn.com/image/fetch/$s_!vohW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 1272w, https://substackcdn.com/image/fetch/$s_!vohW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde47df3c-568a-4900-bf90-34fa5f635bb0_2219x1424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DuckDB&#8217;s federated query capabilities</figcaption></figure></div><p></p><p>Querying a remote file in cloud storage can be done without downloading the entire file. DuckDB efficiently samples and inspects data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IWRt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IWRt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 424w, https://substackcdn.com/image/fetch/$s_!IWRt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 848w, https://substackcdn.com/image/fetch/$s_!IWRt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 1272w, https://substackcdn.com/image/fetch/$s_!IWRt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IWRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png" width="1456" height="654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178289,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IWRt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 424w, https://substackcdn.com/image/fetch/$s_!IWRt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 848w, https://substackcdn.com/image/fetch/$s_!IWRt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 1272w, https://substackcdn.com/image/fetch/$s_!IWRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b6c5b4a-55cb-4443-afbc-9a5bbc7e7914_1830x822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>A Single-Node Compute Engine</h2><p>DuckDB can also act as a <strong>single-node compute engine</strong>, performing ephemeral batch transformations. It&#8217;s like having a stand-alone Spark at your disposal for smaller-scale workloads.</p><p>This is particularly useful in scenarios like data lake architectures, where DuckDB can efficiently serialise raw data (e.g., JSON or CSV) into optimised formats like Parquet, and then transform or aggregate that data.</p><p>I tested this functionality implementing a simple data lake following <a href="https://www.databricks.com/glossary/medallion-architecture">medallion architecture</a>, using GitHub events data dumps (known as <strong><a href="https://github.com/igrigorik/gharchive.org">GH Archive</a></strong>) as the data source.</p><p>I setup an hourly data ingestion pipeline to collect and load hourly archives from <em><a href="http://www.gharchive.org/">gharchive.org</a></em> server into my data lake&#8217;s Raw zone on S3. </p><p>I then created an hourly transformation pipeline using DuckDB to easily unpack, clean and serialise the hourly JSON dumps into Parquet formats published to my Silver (clean) zone. </p><p>Finally, I used DuckDB again to run a daily aggregation job to aggregate the Github events grouped by <em>event type</em>, <em>repository, </em>and<em> event date</em>, and export the result to the Gold (analytics) zone. </p><p>The following code sample demonstrates the aggregation logic. On my laptop, DuckDB reads and processes 24 compressed Parquet files from S3, containing a total of approximately 5 million records, and exports the results back to cloud in about 5 minutes.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T1e6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T1e6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 424w, https://substackcdn.com/image/fetch/$s_!T1e6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 848w, https://substackcdn.com/image/fetch/$s_!T1e6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 1272w, https://substackcdn.com/image/fetch/$s_!T1e6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T1e6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png" width="1456" height="270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100204,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T1e6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 424w, https://substackcdn.com/image/fetch/$s_!T1e6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 848w, https://substackcdn.com/image/fetch/$s_!T1e6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 1272w, https://substackcdn.com/image/fetch/$s_!T1e6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdb0b11b-717e-48bb-8133-acaf2d0cbe10_1834x340.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The result was a simple and efficient data pipeline, and anlaytical-ready datasets in Parquet format that could be queried from my local machine using DuckDB&#8217;s SQL or Python API.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pl0L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pl0L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 424w, https://substackcdn.com/image/fetch/$s_!Pl0L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 848w, https://substackcdn.com/image/fetch/$s_!Pl0L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!Pl0L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pl0L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:304920,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pl0L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 424w, https://substackcdn.com/image/fetch/$s_!Pl0L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 848w, https://substackcdn.com/image/fetch/$s_!Pl0L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!Pl0L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a6c39f1-f662-4b16-b01a-51d6300b2abd_2067x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The following shows the result of the query to get the top GitHub repositories receiving the most stars on 2024-08-27, using DuckDB's Python API on my laptop:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NRcn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NRcn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 424w, https://substackcdn.com/image/fetch/$s_!NRcn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 848w, https://substackcdn.com/image/fetch/$s_!NRcn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 1272w, https://substackcdn.com/image/fetch/$s_!NRcn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NRcn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:255622,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NRcn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 424w, https://substackcdn.com/image/fetch/$s_!NRcn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 848w, https://substackcdn.com/image/fetch/$s_!NRcn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 1272w, https://substackcdn.com/image/fetch/$s_!NRcn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a0d3554-130f-4c53-b2bf-b4e5ce613362_1830x1026.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>DuckDB might not be able handle terabyte-scale data like distributed engines, but many use cases don&#8217;t require such scale. Besides, modern servers are powerful enough to handle a lot of the heavy lifting without needing distributed processing frameworks.</p><p></p><p>Given the above capabilities, what would you call a system which is:</p><blockquote><p><strong>An embeddable and portable DBMS, columnar OLAP database, a SQL-based interoperable DataFrame, federated query engine, and a single-node compute engine?</strong></p></blockquote><p>Let's call it <strong>DuckDB</strong>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E84f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E84f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 424w, https://substackcdn.com/image/fetch/$s_!E84f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 848w, https://substackcdn.com/image/fetch/$s_!E84f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 1272w, https://substackcdn.com/image/fetch/$s_!E84f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E84f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png" width="1456" height="907" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:907,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:594305,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E84f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 424w, https://substackcdn.com/image/fetch/$s_!E84f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 848w, https://substackcdn.com/image/fetch/$s_!E84f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 1272w, https://substackcdn.com/image/fetch/$s_!E84f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa25d3be5-f2af-4037-8463-c0e7bc8131eb_2085x1299.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>WHAT DUCKDB CAN BE</h1><div><hr></div><p>Now that we've explored what DuckDB is, let's dive into its future potentials, examining how it can evolve within the broader data engineering and data science ecosystem.</p><h2>The Missing Piece in the Data Scientist's Toolbox</h2><p>DuckDB fills a crucial gap in the data scientist&#8217;s toolbox: a powerful in-process database engine that integrates seamlessly into data science workflows without the need for complicated setups, installations, or data transfers between Python and external databases. </p><p><a href="https://open.spotify.com/episode/7mp1QYRJR5Q8Hg9gHACVGG">According to Mark Raasveld</a>, one of DuckDB's creators, this was one of the primary design goals for building it.</p><p>Data scientists often find the process of setting up and using external databases cumbersome. They prefer working with plain-text or binary data files (like CSV or Parquet) for the ease of immediate access and manipulation. </p><p>However, DuckDB changes this dynamic by enabling data scientists to work with a relational SQL database directly within their Python environment, eliminating the need for external database installations and the associated data transfer overhead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y2HR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y2HR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 424w, https://substackcdn.com/image/fetch/$s_!y2HR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 848w, https://substackcdn.com/image/fetch/$s_!y2HR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!y2HR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y2HR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png" width="1229" height="1056" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1056,&quot;width&quot;:1229,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:235761,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!y2HR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 424w, https://substackcdn.com/image/fetch/$s_!y2HR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 848w, https://substackcdn.com/image/fetch/$s_!y2HR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!y2HR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb1902ab-0737-47e3-b68a-b7f10e471e82_1229x1056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Bring Your Own Compute</h2><p>As personal computers and laptops become more powerful, DuckDB enables a "<em><strong>bring-your-own-compute</strong></em>" model. It allows users to leverage their local machines to run analytics on shared data, whether that data is stored locally or in the cloud. </p><p>Users can easily attach to data files or external database system hosted in the cloud and perform their analyses using DuckDB's local processing power. This is especially beneficial for collaborative projects, such as those involving research teams.</p><p>An alternative hybrid model is also possible, where part of the data processing happens in the cloud while another part is handled locally. This approach, pioneered by <a href="http://MotherDuck.com">MotherDuck</a>, combines cloud scalability with the efficiency of local compute power.</p><h2>A Standard Format for Publishing Relational and Analytical Datasets</h2><p>The DuckDB format (<code>.duckdb</code> database file) has the potential to become the standard for sharing relational data on the internet. </p><p>Instead of exporting multiple data objects as separate CSV files, dataset publishers can export a complete dataset as a single <em>duckdb</em> file, containing all the data in an optimised, ready-to-analyse columnar format. </p><p>This approach streamlines data sharing and ensures that the data is structured for efficient querying and analysis right out of the box.</p><h2>Expand DataFrames' Footprint in Data Engineering</h2><p>The DuckDB Python package bridges SQL and Python, allowing users to query Python in-memory objects like Pandas <em><strong>DataFrames</strong></em> as if they were database tables. </p><p>This capability makes DataFrames accessible to a wider audience, including those more comfortable with SQL than Python. It also enhances the use of DataFrames for building data transformation pipelines using SQL.</p><p>By leveraging DuckDB&#8217;s SQL interface, users could even apply transformational models like <em><strong>dbt</strong></em> on top of their DataFrames, further extending the role of DataFrames in data engineering workflows.</p><h2>Turn OLTP Databases into HTAP</h2><p>The <strong><a href="https://motherduck.com/blog/pg_duckdb-postgresql-extension-for-duckdb-motherduck/">recent introduction</a></strong> of <em><strong><a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a></strong></em> and also <em><strong><a href="https://github.com/paradedb/pg_analytics">pg_analytics</a></strong></em> is another exciting development. </p><p>With <em>pg_duckdb </em>Postgres extension, you can run analytical queries directly within Postgres database without needing to offload data to separate analytical systems.</p><p>With <em>pg_analytics</em> extension it&#8217;s possible to directly run queries over datasets stored on object stores like S3, and open table formats like Iceberg or Delta Lake directly from Postgres, using the embedded DuckDB as the underlying query engine.</p><p>Cloud ProstgreSQL providers such as <strong><a href="https://www.crunchydata.com/products/crunchy-bridge-for-analytics">Crunchy Data</a></strong> are already integrating and taking advantage of DuckDB to extend their service offering for OLAP and analytical use cases.</p><h2>In-Browser SQL Analytics</h2><p>In-browser analytics are becoming a reality, thanks to DuckDB&#8217;s WebAssembly implementation (<strong><a href="https://duckdb.org/2021/10/29/duckdb-wasm.html">DuckDB-WASM</a></strong>). This brings SQL-powered analysis directly into the browser, enabling data analysts to interact with datasets using SQL without installing any software. </p><p>It also offers a cost-effective solution for organisations that expose public datasets by allowing the analysis to be done client-side, reducing server load.</p><p>For instance, <strong>Hugging Face</strong> recently introduced a <strong><a href="https://huggingface.co/blog/cfahlgren1/querying-datasets-with-sql-in-the-browser">Datasets Explorer Chrome extension</a></strong>, allowing users to explore their datasets using SQL directly in the browser&#8212;powered by DuckDB&#8217;s engine under the hood. This innovation makes analysing public datasets easier and more accessible than ever.</p><div><hr></div><p>As we've explored, DuckDB is far more than just another database engine&#8212;it unlocks new possibilities for data storage, analysis, and computation. </p><p>If you've discovered additional capabilities or unique use cases for DuckDB, I'd love to hear your thoughts in the comments.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[Metadata Layer Architecture of Open Table Formats]]></title><description><![CDATA[Exploring the Design and Architecture of the Metadata Layer in Modern Data Lakehouse Table Formats]]></description><link>https://www.pracdata.io/p/metadata-layer-design-of-open-table-formats</link><guid isPermaLink="false">https://www.pracdata.io/p/metadata-layer-design-of-open-table-formats</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 08 Sep 2024 11:04:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cBD5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern Open Table Formats, such as <strong>Hudi</strong>, <strong>Delta Lake</strong>, and <strong>Iceberg</strong>, are built on the foundation of a <strong>file-based metadata layer</strong>. In these formats, all metadata related to tables, columns, data files, and partitions is stored in metadata files alongside the actual data in data lakes. </p><p>In previous generation Hive-style <em>directory-oriented</em> table format, which I have covered their evolution in a <strong><a href="https://practicaldataengineering.substack.com/p/the-history-and-evolution-of-open">comprehensive essay</a></strong>,  data files and partitions were tightly coupled to the underlying physical storage system API.  </p><p>The query engines relied on the file system API for listing data files and partitions each time a query was executed. Additionally, in systems like Hive an external Metastore is employed for keeping track of metadata such as table partitions.</p><p>The modern data lakehouse engines represents an alternative mechanism for implementing table formats in cloud storage systems. Modern table formats maintain the entire table state within <strong>log-oriented metadata files</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cBD5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cBD5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 424w, https://substackcdn.com/image/fetch/$s_!cBD5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 848w, https://substackcdn.com/image/fetch/$s_!cBD5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 1272w, https://substackcdn.com/image/fetch/$s_!cBD5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cBD5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:793423,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cBD5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 424w, https://substackcdn.com/image/fetch/$s_!cBD5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 848w, https://substackcdn.com/image/fetch/$s_!cBD5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 1272w, https://substackcdn.com/image/fetch/$s_!cBD5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54a127b5-30ec-4964-9263-f917afb11133_2168x1198.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>In this article we will explore the overall design and architecture of the metadata layer on these modern table formats to better understand how they are structured and managed.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you find this article useful, subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>Metadata Structure </h1><p>The design and organisation of the metadata layer can have a significant impact on the metadata access strategy during query planning, which in turn can affect overall query performance. A straightforward approach involves storing all metadata for each dataset in a dedicated special directory which you could think of it as a <em><strong>sub-table</strong></em> inside the main table. </p><p>This <em><strong>sub-table</strong></em>, containing the metadata records in a semi-structured log files, is stored alongside the actual data files within the data lake, typically as a subdirectory directly under the root dataset path.</p><p>Lets say we have a table called <em>mytable</em>, directly under the main path we can store our special sub-table called ../<em>metadata_logs</em> as a subdirectory containing the metadata logs.</p><pre><code>/mytable/
   /metadata_logs/
&#9;000001.log
&#9;000002.log
&#9;000003.log
&#9;000004.log</code></pre><p>Together with the data files and partitions the overall structure of a simple table would look like the following:</p><pre><code>/mytable
     /ts=2024-08-01/20240801-01.parquet
&#9;          20240801-02.parquet
&#9;          20240801-03.parquet
     /ts=2024-08-02/20240802-01.parquet
&#9;&#9; 20240802-02.parquet
     /metadata_logs/000001.log
&#9;&#9; 000002.log
&#9;&#9; 000003.log
&#9;&#9; 000004.log</code></pre><p>At the time of the query engine would scan the available metadata logs stored under the <em>../metadata_logs</em> subdirectory during query planning phase to build a list of actual data partitions and files to be scanned.</p><h3>How is this more efficient?</h3><p>This design fundamentally eliminates the dependency on the underlying storage metadata API for performing directory (i.e., partition) and data file listings during the query planning phase&#8212;a process that often incurs higher latency and can become a significant performance bottleneck in large-scale data lakes.</p><p>Instead, the approach leverages the storage's fast sequential I/O capabilities to scan and read a few metadata log files end-to-end. This method offers significantly better performance compared to gathering file lists and statistics by issuing many <code>LIST</code> API calls. </p><p>These API calls are often subject to throttling and limits, typically returning only 1,000 objects per call on cloud platforms. This can be particularly problematic when dealing with potentially hundreds of thousands of files and numerous partitions in large-scale data lakes.</p><p>In a typical <strong>Hive</strong> or <strong>Spark</strong> workload that needs to perform a query on a table containing couple of years of data partitioned by date, the engine would have to make thousands of API calls for gathering details of all available partitions and their data files in order to perform split planning for assigning work to each worker.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wr3A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wr3A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 424w, https://substackcdn.com/image/fetch/$s_!wr3A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 848w, https://substackcdn.com/image/fetch/$s_!wr3A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 1272w, https://substackcdn.com/image/fetch/$s_!wr3A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wr3A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png" width="1456" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:546521,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wr3A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 424w, https://substackcdn.com/image/fetch/$s_!wr3A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 848w, https://substackcdn.com/image/fetch/$s_!wr3A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 1272w, https://substackcdn.com/image/fetch/$s_!wr3A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b02837-46ec-46c9-81dd-9b9e8e477090_1787x1151.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>However, if all the details of all data files and partitions are stored in few metadata log files, the engine can quickly scan the logs using fast sequential I/O to identify all the required objects, which is substantially faster.</p><p>Based on industry benchmarks, the new approach significantly reduces the directory and file listing latency compared to direct listing using the underlying storage API such as HDFS and S3. </p><p><a href="https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi">Hudi's experiments</a> shows 2-10x improvement in listing latency compared to S3 over large datasets.</p><h2>Hierarchical Metadata Organisation</h2><p>Another way to structure the metadata layer is by using a hierarchical approach. In this method, the lower levels of the hierarchy consist of files that store metadata about specific sets of data files. As you move up the hierarchy, index files aggregate metadata from the layers below, effectively serving as a table index.</p><p>In <strong>Apache Iceberg</strong>&#8217;s layered design, at the lowest level, each metadata file (refered to as <em><strong>manifest file</strong></em> in Iceberg terminology) would track a subset of the available data files for a table. </p><p>In the second layer, a <em><strong>Manifest list</strong></em> file contains a high-level index of the collection of the <em>manifest files</em> in the lower layer. It stores essential details such as the current snapshot ID, the locations of the manifest files, and partition boundaries.</p><p>Finally, a master <em><strong>Metadata File</strong></em> sits at the top of the hierarchy as shown below, storing a high-level snapshot view of the table's metadata. This file is referenced by the table's metadata pointer attribute and is regenerated whenever there is a change in the table's metadata.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zmxs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zmxs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 424w, https://substackcdn.com/image/fetch/$s_!Zmxs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 848w, https://substackcdn.com/image/fetch/$s_!Zmxs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!Zmxs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zmxs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png" width="1456" height="1613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1613,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:329088,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zmxs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 424w, https://substackcdn.com/image/fetch/$s_!Zmxs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 848w, https://substackcdn.com/image/fetch/$s_!Zmxs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!Zmxs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F247e4b1c-dfc4-4d90-9488-5cc39524f7e0_1466x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Importantly, this hierarchical structure of metadata does not need to correspond to a physical hierarchy in terms of file organisation. For example, in Iceberg, all metadata files are kept under a single metadata directory, regardless of their hierarchical relationships</p><h3>Why would Iceberg structure the metadata in this way?</h3><p>The purpose of designing a hierarchical indexing structure in Apache Iceberg is to optimise metadata lookup performance by minimising the number of metadata files that need to be scanned during query execution. </p><p>By organising metadata in a more structured and layered manner, the system can quickly locate relevant data, leading to faster query response times. However, the trade-off is a more complex metadata structure compared to the simpler, flat design.</p><h1>Metadata Storage Models</h1><p>The storage model is concerned with how the metadata records are managed in the metadata log files.</p><p>An important design factor to consider is that due to the immutable nature of the underlying storage layer, metadata update events&#8212;such as the addition or removal of data files&#8212;cannot simply be appended to the existing metadata files. As a result, for new metadata update operations a new delta file must be generated. </p><p>To ensure storage and I/O efficiency, the frameworks typically perform a periodic background compaction operation. This process merges smaller delta logs into a snapshot base log. At any given point in time, the table's metadata is a combination of the last snapshot log and any new delta logs that have not yet been compacted.</p><p>There are two main storage models currently implemented by the open table formats for storing and managing table's metadata:</p><h2>Log-structured Metadata Model</h2><p>This is the simplest implementation technique where all metadata changes are treated as immutable, ordered events stored sequentially in transactional event logs. </p><p>This approach essentially applies the <a href="https://martinfowler.com/eaaDev/EventSourcing.html">event sourcing pattern</a> to capture all state changes at the file level, recording them in transactional logs that are stored alongside the actual data files similar to WAL implementation of the database engines. </p><p>Files and partitions serve as the unit of record for which the metadata layer tracks all state changes, capturing these changes within the event log. To rebuilt the current state of the table, all records in available logs are scanned sequentially from top-to-bottom.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rrii!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rrii!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 424w, https://substackcdn.com/image/fetch/$s_!Rrii!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 848w, https://substackcdn.com/image/fetch/$s_!Rrii!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 1272w, https://substackcdn.com/image/fetch/$s_!Rrii!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rrii!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png" width="1034" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1034,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109207,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rrii!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 424w, https://substackcdn.com/image/fetch/$s_!Rrii!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 848w, https://substackcdn.com/image/fetch/$s_!Rrii!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 1272w, https://substackcdn.com/image/fetch/$s_!Rrii!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4fe55ba9-4c6b-4506-93a3-b5c950b6b849_1034x581.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Instead of using an unstructured format, semi-structured formats like JSON can be employed to provide better schema support and flexibility, while being both human-readable and easily parsed by machines. This is the approach used by <strong>Delta Lake</strong>, where transactional metadata logs are stored in JSON format.</p><h2>Table-oriented Metadata Model</h2><p>In an object or table-oriented metadata design, the framework treats table metadata&#8212;primarily dataset partitions and file listings&#8212;as a special "<em>table</em>" stored in a more structured file container, similar to how it handles the base table containing the actual data files. </p><p>The key difference from the framework's perspective is that this special table is used to manage metadata events rather than actual data records.</p><h3>What are the trade-offs between the two models?</h3><p>Compared to a log-structured implementation, a table-oriented metadata design is a relatively more closed and abstract approach to managing the metadata layer in open table formats. </p><p>In a log-structured approach, transactional logs are the first-class citizens of the metadata layer, directly accessible by the framework as well as other engines and users. In contrast, the table-oriented design uses a logical table as the access point for all metadata operations, abstracting the underlying metadata log files. </p><p>A Combination of the two models are used by the existing frameworks. <strong>Apache Hudi</strong> heavily relies on optimised table-oriented metadata management using <strong>HFile</strong> structured file format. The metadata table is internal to the framework and not exposed so much to users or clients. </p><p>On the other hand, <strong>Delta Lake</strong> and <strong>Apache</strong> <strong>Iceberg</strong> follow more open log-oriented design offering the additional benefit of supporting metadata streaming and event-based ingestion primitives out of the box.</p><p></p><h1>Metadata Index Files</h1><p>In this section, we will explore the format of the different metadata files used in the metadata layer, and the how the table events are captured and managed.</p><p></p><h2>Data Files Index</h2><p>A <strong>data file index</strong> is a type of manifest file used to maintain the list of active data files associated with a table. </p><p>As mentioned earlier, the goal is to eliminate the need for performing recursive storage metadata API calls to gather the list of data files. A file index can also help reduce cost on cloud object stores by significantly lowering the rate of API calls required during reading and writing operations.</p><p>Implementing a file index for a non-partitioned table is relatively straightforward. A sequential log file at the base table level can be used to insert new filenames&#8212;either with full or relative paths&#8212; to a metadata log. During the query planning phase, the generated log files can be scanned from start to finish to identify all active files belonging to a table.</p><p>For capturing storage-level or filesystem state changes we need to consider two main filesystem object types, that is files and directories (i.e partitions) with following possible events:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S2fU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S2fU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 424w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 848w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1272w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S2fU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png" width="445" height="168" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:168,&quot;width&quot;:445,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png" title="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png" srcset="https://substackcdn.com/image/fetch/$s_!S2fU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 424w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 848w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1272w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In a most simple form, the files index can be implemented with following fields in a WAL (Write Ahead Log) type file:</p><pre><code>timestamp|object|event type|value
20231015132000|partition|add|/year=2024/month=08/day=15
20231015132010|file|add|/year=2024/month=08/day=15/00001.parquet
20231015132011|file|add|/year=2024/month=08/day=15/00002.parquet
20231015132011|file|add|/year=2024/month=08/day=15/00003.parquet</code></pre><p></p><h3>Implementation of Data Files Index</h3><p><strong>Delta Lake</strong> manages file listing information at the base level of the table using a combination of a base Parquet file and additional JSON transaction log files. All file-level updates are committed to the JSON files at the time of writing. </p><p><strong>Apache Hudi</strong> maintains all file listing metadata in a single base <strong>HFile</strong>, stored under the <em>.hoodie/metadata/files</em> path.  The HFile is partitioned into four sections: <em>files index</em>, <em>column stats,</em> <em>bloom filters</em>, and <em>record level index</em>. The <em><strong>Files index</strong></em> contains two types of records: one for adding new partitions and another for adding new files.</p><p>Following shows an example of adding a new partition key 2024-08-01 with a new file <em>file10.parquet</em>:</p><pre><code>-- Insert a new file record (Type 2):&nbsp;

{"key": "2024-08-01", "type": 2, "filenameToSizeMap": {"file10.parquet": 12345, isDeleted: false} }

-- Insert a new partition record (Type 1):

{"key": "_all_partitions_", "type": 1, "filenameToSizeMap": "2024-08-01": 0, isDeleted: false} }</code></pre><h3>Why Hudi selected HFile as the metadata file format?</h3><p>The motivation behind selecting HFile file format is that even for very large datasets containing thousands of partitions and millions of files, the expected compressed and encoded HFile would be within a manageable size, typically hundreds of megabytes and less than a gigabyte. </p><p>Additionally, having fewer metadata files to scan can significantly improve performance during read operations. Consequently, HFile has been selected for managing both base and log files within the metadata layer by Hudi. </p><p>Hudi claims that based on <a href="https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi">experiments</a>, HFile performs better than Avro and Parquet (Used by Delta and Iceberg) for point lookups (10x to 100x improvements) over large number of metadata entries.</p><p></p><h2>Snapshot File Index</h2><p>Periodically, a <em><strong>log checkpoint</strong></em> operation is performed on a specific set of delta log files to summarise the log up to that point into a snapshot file. This process eliminates duplicate entries and retains only the latest list of active partitions and files. As a result, the checkpoint file represents the state of the table at a specific point in time. </p><p>In <strong>Delta Lake,</strong> the snapshot file is maintained in Parquet format. To compute the current table state in terms of active files and partitions during a query, the latest snapshot Parquet file, along with any subsequent JSON transactional logs, must be consulted to obtain the current list of active files in the table.</p><p>Example of an &#8216;<em>add</em>&#8217; action log record in Delta:</p><pre><code>{
  "add": {
    "path": "date=2024-08-10/part-000...c000.gz.parquet",
    "partitionValues": {"date": "2014-08-10"},
    "size": 841454,
    "modificationTime": 1512909768000,
    "dataChange": true,
    "baseRowId": 4071,
    "defaultRowCommitVersion": 41,
    "stats": "{\"numRecords\":1,\"minValues\":{\"val..."
  }
}</code></pre><p></p><h2>Managing Deletes</h2><p>When files or partitions are deleted, the file index needs to track these changes and remove the corresponding listings. Due to immutable nature of the log files, the file index cannot be simply updated to remove the entries. </p><p>A common approach to handling deletions is the use of a <em><strong>delete tombstone method</strong></em>, where a new record with a delete marker is inserted into the metadata log. This marker is then used during query time to identify and filter out the deleted files. </p><p>Additionally, a background compaction operation can be employed to clean up the file index by removing duplicate keys and older versions of records, thereby maintaining an up-to-date and efficient index</p><p><strong>Hudi</strong> uses a special field in each record inserted into the HFile index called &#8216;<em><strong>isDeleted</strong></em>&#8217; as the delete tombstone hint:</p><pre><code>{"key": "2024-08-12", "type": 2, "filenameToSizeMap": {"file20.parquet": 65432, isDeleted: true}}</code></pre><p><strong>Delta Lake</strong> follows similar delete tombstone technique, however unlike Hudi which uses an attribute in the inserted log as a <em><strong>delete hint,</strong></em> Delta Lake uses a distinct metadata action for the transaction log  called &#8216;<em><strong>remove</strong></em>&#8217; when committing a new log to the transaction log.</p><p>During log compaction process previous duplicate &#8216;<em><strong>add</strong></em>&#8217; actions are eliminated but their &#8216;<em>remove</em>&#8217; actions are maintained until the retention period has expired.</p><p>Example of a &#8216;<em><strong>remove</strong></em>&#8217; action log record in Delta:</p><pre><code>{
  "remove": {
    "path": "part-00001-4.parquet",
    "deletionTimestamp": 1515488765783,
    "baseRowId": 1171,
    "defaultRowCommitVersion": 11,
    "dataChange": true
  }
}</code></pre><p>Unlike the previous two implementations, <strong>Apache Iceberg</strong> maintains a separate manifest file which only contains deleted files logs. At read time, the delete manifest files are scanned first during query planning to identify which files to exclude during query planning phase.</p><p></p><h2>Column-stats Index</h2><p>An optimisation feature of open table formats is the storage of table statistics, such as column min-max values, in a separate metadata structure to facilitate faster query planning. </p><p>The goal is to eliminate the need for the conventional, expensive per-object statistics gathering during the query planning phase. This typically involves scanning the footer sections of all Parquet or ORC data files to prune irrelevant data files. This approach is not scalable when dealing with a large number of partitions and data files, as a large query would require scanning each serialised file's footer separately, resulting in numerous I/O calls. </p><p>By storing column statistics in a separate, efficient metadata index file, query engines can leverage faster and more efficient sequential I/O to perform the necessary query planning, significantly improving performance</p><p></p><h3>Implementation of Column-Stats Index</h3><p><strong>Delta Lake</strong> supports storing per-column statistics within the same JSON-based delta log files, using a nested structure within the main data file metadata record. This structure mirrors the schema of the actual data by storing each file's minimum and maximum statistics. </p><p>For example, if the schema contains columns like <em>product</em> and <em>price</em>, the &#8216;<em><strong>stats</strong></em>&#8217; key in the metadata structure of the delta log would resemble the following format:</p><pre><code>{
  "add": {
    "path": "date=2014-08-10/part-000...c000.gz.parquet",
    "partitionValues": {"date": "2024-08-10"},
    ...
    "stats":{
         "numRecords": 10,
&#9;...
&#9;"minValues": ["product": "book","price": 15],
&#9;"maxValues": ["product": "book","price": 23]
    }
}</code></pre><p><strong>Apache Iceberg</strong>'s design for storing both table and column-level stats is similar to Delta, by storing each type of statistic such as column min and max values, in a separate field of array type, inside the main <em><strong>data_file</strong></em> metadata record of manifest files.</p><p>To make the column stats object more compact, Iceberg stores the column id as the reference instead of using the column name</p><pre><code>"value_counts": [
     {
&#9;"key": 1,
&#9;"value": 34535
     },
     {
&#9;"key": 2,
&#9;"value": 54321
     },
     {
&#9;"key": 3,
&#9;"value": 5566712
     },&#9;
]</code></pre><p>Since version 0.11, <strong>Apache Hudi</strong> introduced a <em><strong>multi-modal index,</strong></em> which also stores column-level statistics within the same metadata table, specifically in the HFile metadata log. </p><p>These statistics are stored under a separate partition (i.e., key) as a stand-alone partition called <em><strong>column_stats</strong></em>. By keeping the keys sorted, this structure leverages the locality of all records for a particular column, enhancing query performance and efficiency.</p><p>Since the statistics are stored under a separate key, each record must include the filename and path fields which can be seen as an extra overhead compared to Delta and Iceberg implementation. The current supported statistics include <em>minValue</em>, <em>maxValue</em>, <em>valueCount</em>, and <em>nullCount</em>. </p><p>Unlike Delta and Iceberg, which store statistics differently, Hudi uses an additional field, &#8216;<em>isDeleted</em>&#8217;, as a marker to indicate whether a column stats record is valid or not, due to the stats being stored under a separate partition. </p><h3>Advantage of combining different index types in a single object </h3><p>Using a single index file to manage both types of metadata (files and column stats) offers the advantage of reducing both metadata file management complexity and I/O overhead. Additionally, employing a single object with nested structures to store additional metadata allows for the use of a unified schema to manage the table's metadata. </p><p>This approach eliminates the need to manage multiple schemas for each metadata type, simplifying overall metadata management. Furthermore, there is no need to separately track the statistical metadata of deleted data files, as the unified structure inherently handles this</p><p></p><h1>Partition Management</h1><p>As open table formats represent an evolution of data lake formats by introducing a metadata abstraction layer on top, they inherently support conventional hierarchical partitioning schemes on data lakes.</p><p>Unlike traditional <em><strong>Hive-style partitioning</strong></em>, where partitions are physically represented by a directory structure, open table formats treat these directories as configurations. </p><p>These configurations provide instructions to the processing engine on how to structure the data and maintain the relationship between data files and the projected structure. The partitions themselves are stored and tracked in the metadata layer, either as a separate structure (index) or embedded within the entries of the main file index log.</p><p>By managing the mapping of files to partitions in the metadata layer, this design effectively decouples physical partitioning from logical partitioning at the table level. </p><p>This decoupling allows for greater flexibility, supporting partition evolution, partition consolidation, and the possibility of utilising different partition schemes within the same dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_DIs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_DIs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 424w, https://substackcdn.com/image/fetch/$s_!_DIs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 848w, https://substackcdn.com/image/fetch/$s_!_DIs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 1272w, https://substackcdn.com/image/fetch/$s_!_DIs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_DIs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:767474,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_DIs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 424w, https://substackcdn.com/image/fetch/$s_!_DIs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 848w, https://substackcdn.com/image/fetch/$s_!_DIs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 1272w, https://substackcdn.com/image/fetch/$s_!_DIs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb94ea9-aaca-4481-9ae6-ae03893220c9_2104x1195.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Moreover, this approach abstracts the partitioning scheme from ETL and data ingestion pipelines. It eliminates the need for these pipelines to be aware of the table's partitioning scheme, allowing for either explicit or dynamic partitioning in SQL transformation statements. </p><p>This abstraction also mitigates some of the traditional trade-offs, such as the challenges associated with using event time for partitioning, particularly in handling late-arriving events.</p><p>Another notable technique employed by table formats like Apache Iceberg involves automatically generating and using hash-based paths in cloud object stores. This strategy distributes files within a partition across multiple prefixes, helping to avoid throttling and performance issues when scanning a large number of files, especially due to per-prefix call limits.</p><p></p><h3>Implementation of Partition Management</h3><p>In its simplest form, a new log entry can be added to the file index for each partition added to the base table within a log-oriented structure. This approach is akin to how <strong>Delta</strong> manages metadata updates. </p><p>Specifically, Delta captures any changes to a table's metadata&#8212;such as schema modifications and updates to the partitioning scheme&#8212;within a &#8216;<em><strong>metaData</strong></em>&#8217; action type (a <em>struct</em>) as an entry in Delta's transaction logs. The &#8216;<em><strong>partitionColumn</strong></em>&#8217; attribute within this record captures the current partition columns of the table, ensuring that the metadata reflects the most up-to-date structure.</p><p>Here&#8217;s an example of the &#8216;<em>metaData</em>&#8217; action entry in Delta&#8217;s log files:</p><pre><code>
{
  "metaData":{
    "id":"af44c3a7-abc1-4a5a-a3d9-33d65ac711ba",
    "format":{"provider":"parquet","options":{}},
    "schemaString":"...",
    "partitionColumns":["process_date"],
    "configuration":{
      "appendOnly": "true"
    }
  }
}</code></pre><p>In an <em>object-oriented </em>metadata structure, all partitions can be encapsulated within their own object (<em>struct</em>) in the main index file. This method is employed by <strong>Apache Hudi</strong>, where table partitions are tracked by inserting a new log entry into the base HFile index with a specific key, &#8216;<em><strong>_all_partitions_</strong></em>&#8217;. </p><p>This key aggregates all partitions within a single object, making it easier to manage and track the partitions at the table level. By using this approach, Apache Hudi efficiently maintains an up-to-date record of all partitions within the dataset.</p><pre><code>{
   "key": "_all_partitions_",
   "type": 1,
   "filenameToSizeMap": "2024-08-10",
   "isDeleted": "false"
}</code></pre><p><strong>Apache Iceberg</strong> also stores partitioning details in the top-level metadata JSON file under its own &#8216;<em><strong>partition-specs</strong></em>&#8217; key. This key encapsulates the partitioning scheme used by the table, detailing how data is logically divided. </p><pre><code>{
    ...
    "partition-specs" : [ {
        "spec-id" : 0,
        "fields" : [ {
            "name" : "process_date",
            "transform" : "process_date",
            "source-id" : 2,
            "field-id" : 1000
        } ],
}</code></pre><p>Apache Iceberg goes a step further by storing additional partition statistics such as record count, file count, total size, and delete record count within its metadata. These statistics can significantly enhance query planning and cost-based query optimisation. </p><p>Additionally, these statistics are used to facilitate Dynamic Partition Pruning, a technique that reduces the amount of data scanned during queries by eliminating irrelevant partitions early in the query execution process. </p><div><hr></div><p>What has been presented here forms the foundation of the architecture behind the metadata layer in modern open table formats. However, there are many additional details and advanced features that these systems offer. For those interested in exploring further, the official documentation of the covered projects are a good source of reference.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[DLD #2 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Curated Knowledge on Data Engineering Landscape]]></description><link>https://www.pracdata.io/p/dld-2-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-2-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sat, 31 Aug 2024 16:16:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h2>&#10024; Featured - Netflix Maestro Workflow Engine</h2><div><hr></div><p>The crowded field of open-source engines has just welcomed a new player: <strong>Maestro</strong>, recently open-sourced by <strong>Netflix</strong>!</p><p>Netflix asserts that Maestro is a highly scalable and flexible scheduler capable of managing large-scale heterogeneous workflows, including ML training and data pipelines. It supports flexible execution logic, such as Docker images and notebooks, and accommodates various workflow patterns, including cyclic and acyclic (DAGs). </p><p>One of its standout features is the <em><strong>foreach</strong></em><strong> pattern</strong>, which is particularly useful for repetitive tasks like ML model training and data backfilling&#8212;something that would typically require separate job runs on a scheduler like <strong>Airflow</strong> to backfill daily ingested source data. Maestro also offers multiple <strong>domain-specific languages (DSLs)</strong> for defining workflows declaratively using YAML files, a feature that would need to be custom-built on top of Airflow.</p><p>Since its open-source release in July, the project has already garnered 3,000 stars on <a href="https://github.com/Netflix/maestro">Github</a>. Netflix had <strong><a href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">previously covered</a></strong> the internals and use cases implemented with this engine, and their <strong><a href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78">latest blog post</a></strong> provides a comprehensive overview of Maestro's features and supported workflow patterns.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8N2F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8N2F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 424w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 848w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1272w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8N2F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png" width="510" height="367.78846153846155" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1050,&quot;width&quot;:1456,&quot;resizeWidth&quot;:510,&quot;bytes&quot;:351926,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8N2F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 424w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 848w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1272w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Additionally, this <strong><a href="https://blog.det.life/netflix-maestro-and-apache-airflow-competitors-or-companions-in-workflow-orchestration-2bce948956a5">blog post</a></strong> provides a thorough comparison between <strong>Airflow</strong> and <strong>Maestro</strong>, complete with practical examples and code snippets. </p><p></p><div><hr></div><h2>&#128161; Trends &amp; Insight</h2><div><hr></div><h4>&#128073; The State of Modern Data Stack in 2024</h4><p>In 2024, there has been much discussion about the decline of the <strong>Modern Data Stack (MDS)</strong>. Concerns have been raised about the economics of the Modern Data Stack, and the term itself is being recycled, much like previously hyped concepts such as "Big Data". Some experts believe that many MDS startups are <strong><a href="https://joereis.substack.com/p/everything-ends-my-journey-with-the">doomed to extinction</a></strong>. As <strong><a href="https://mattturck.com/mad2024/">Matt Turck</a></strong> pointed out, the Modern Data Stack was largely a <em>marketing concept and an alliance among several startups across the data value chain</em>. In a recent blog post, <strong>Ananth</strong> explores the history, decline, and the emerging <strong>post-MDS</strong> era. <strong><a href="https://www.dataengineeringweekly.com/p/a-brief-history-of-modern-data-stack">&#8212;&gt; Read more</a></strong></p><h4>&#128073; Embracing '<em>Bring Your Own Compute</em>' with DuckDB</h4><p> An interesting discussion took place between <strong>MotherDuck</strong>&#8217;s co-founder and CEO and <strong>Fivetran</strong>&#8217;s co-founder and CEO about the future of big data and single-node or laptop-sized analytics using <strong>DuckDB</strong>. With advancements in hardware, we might witness a new shift towards <em>local execution</em>, where computing is fully or partially pushed to the user&#8217;s machine, introducing the concept of <em><strong>Bring Your Own Compute</strong></em> for analytics. <strong><a href="https://www.fivetran.com/blog/the-future-of-big-data-processinghttps://www.fivetran.com/blog/the-future-of-big-data-processing-may-be-laptop-sized-may-be-laptop-sized">&#8212;&gt; Watch the interview</a></strong></p><div><hr></div><h2>&#128225; Open Source News</h2><div><hr></div><h4>&#128073;  Apache Kafka 3.8 Release</h4><p><strong>Apache Kafka 3.8</strong> has been Released. Confluent blog provided a summary of the the new features and improvements in this release. <strong><a href="https://www.confluent.io/blog/introducing-apache-kafka-3-8/">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Delta Lake 4.0 New Features</h4><p><strong>Delta Lake 4.0</strong> Preview was <strong><a href="https://delta.io/blog/delta-lake-4-0/">announced</a></strong> in June. A blog post highlights the new important features such as Change Data Feed (CDF), Liquid Clustering and new monitoring features of this release with some practical examples. <strong><a href="https://medium.com/@vishalbarvaliya/delta-lake-4-0-a-simple-guide-842c3afbcd06">&#8212;&gt; Read More</a></strong></p><h4>&#128073; DAG Factory Project Takeover by Astronomer</h4><p> <strong>Astronomer</strong> has announced taking over the open-source project<a href="https://github.com/astronomer/dag-factory"> </a><strong><a href="https://github.com/astronomer/dag-factory">DAG Factory</a></strong>, a Python library for authoring Airflow DAGs declaratively using YAML configuration files. Providing a thin no-code abstraction layer on top of Airflow has become a common practice among tech companies to reduce the engineering effort required to create DAGs, standardise pipeline creation for common use cases like data transformation, and make the process more self-service. <strong><a href="https://www.astronomer.io/blog/astronomer-adopts-dag-factory-democratize-writing-data-pipelines/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128736; Practical Data Engineering</h2><div><hr></div><h4>&#128073; Consistent Data Modeling and Naming Conventions</h4><p> Implementing a consistent data modeling framework, such as standardised naming conventions, is crucial to maintaining a healthy data platform and ensuring long-term scalability, even when there is turnover among engineers. Mike discusses some of the key aspects and best practices for data warehouse modeling, including effective naming conventions for tables and schemas. <strong><a href="https://towardsdatascience.com/advanced-data-modelling-1e496578bc91">&#8212;&gt; Read more</a></strong></p><h4>&#128073; State of CI/CD for Data Pipelines</h4><p><strong>LakeFS</strong> published a comprehensive overview of implementing Continuous Integration/Continuous Delivery (CI/CD) for data pipelines, focusing on the <strong>Write-Audit-Publish (WAP)</strong> ingestion pattern. The article explores various options and tools available in the market, offering insights into how to effectively integrate CI/CD practices into data workflows. <strong><a href="https://lakefs.io/blog/cicd-pipeline-guide/">--&gt; Read more</a></strong></p><h4> &#128073; Data Reconciliation Techniques and Best Practices</h4><p><strong>Datafold</strong> has published a three-part series on <strong>data reconciliation</strong>, a crucial subset of data quality. The series covers use cases, techniques, challenges, and best practices for performing data reconciliation across data sources and targets, with the goal of ensuring data accuracy and completeness. <strong><a href="https://www.datafold.com/blog/what-is-data-reconciliation">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  dbt Beyond the Marketing Hype</h4><p>There have been many blog posts and discussions about the hype <strong>dbt</strong> has generated over the past few years. The author of this <strong><a href="https://blog.det.life/no-data-engineers-dont-need-dbt-30573eafa15e">blog post</a></strong> takes a different approach, discussing the challenges of performing data transformation in data warehouses and how dbt can address them. It starts with real problems and then explores how tooling, specifically dbt, provides solutions&#8212;rather than starting with the tool (because it's popular and everyone is talking about it) and then searching for problems it can solve. The post also offers a clear definition of what dbt actually does:</p><blockquote><p> <strong>dbt works by abstracting common data-warehouse patterns into config-driven automation and providing a suite of tools to simplify SQL transformations, tests, and documentation.</strong></p></blockquote><p></p><div><hr></div><h2>&#9881;&#65039; Technical Deep Dive</h2><div><hr></div><h4>&#128073; Evolution of Debezium's Internal Engine</h4><p> While in many streaming CDC data architectures, <strong>Debezium</strong> plugins are primarily used within the Kafka Connect framework and runtime, it is also possible to run Debezium connectors outside the Kafka ecosystem. This can be done by embedding the Debezium engine in internal applications or by using the standalone <strong>Debezium Server</strong>, which is now a separate project on <strong><a href="https://github.com/debezium/debezium-server/">GitHub</a></strong>. This Debezium blog discusses the evolution of Debezium's internal engine, starting with the initial <code>EmbeddedEngine</code> implementation, which was mainly built for testing, and the new <code>AsyncEmbeddedEngine</code>, which addresses the shortcomings of the previous implementation. <strong><a href="https://debezium.io/blog/2024/07/08/async-embedded-engine/">&#8212;&gt; Read more</a></strong></p><h4>&#128073; A guide on Concurrency Levels of Apache Airflow</h4><p>One of the most confusing aspects of the <strong>Apache</strong> <strong>Airflow</strong> engine, especially for newcomers, is how concurrency is applied at different levels, such as the Airflow scheduler, DAG level, and task level, and how their combination can impact overall workflow performance. The configuration parameters in the config file of early released versions added to this confusion, with names that often seemed unrelated. In a recent blog post, <strong>Google</strong> provides a comprehensive overview of the various concurrency levels in Airflow, with a particular focus on its managed Airflow service. <strong><a href="https://cloud.google.com/blog/products/data-analytics/airflow-dag-and-task-concurrency-in-cloud-compose">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Apache Airflow Software Architecture </h4><p>A helpful guide posted by Apache Airflow's blog with visuals that illustrates the key underlying components of the Apache Airflow software architecture and how they interact within the system. <strong><a href="https://medium.com/apache-airflow/airflow-architecture-simplified-3d582fc3ccb0">&#8212;&gt; Read more</a></strong></p><p>Speaking of Airflow, do you...!?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COx4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COx4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 424w, https://substackcdn.com/image/fetch/$s_!COx4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 848w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1272w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png" width="640" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COx4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 424w, https://substackcdn.com/image/fetch/$s_!COx4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 848w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1272w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h4>&#128073; Bringing GenAI and LLMs to Flink Streaming Pipelines</h4><p><strong>Confluent</strong> blog provided an overview of a new <strong>Flink AI </strong>feature, which allows streaming data pipelines to invoke AI models, including generative AI (GenAI) large language model (LLM) endpoints (such as OpenAI and Google Vertex AI), directly from Flink SQL statements. This enables tasks like AI model inference, regression, and classification to be seamlessly integrated into real-time data processing workflows. <strong><a href="https://www.confluent.io/blog/flinkai-realtime-ml-and-genai-confluent-cloud/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128270; Case Studies</h2><div><hr></div><h4>&#128073; Pinterest's Migration from HBase to TiDB</h4><p><strong>Pinterest</strong> shared their journey of replacing <strong>HBase</strong> storage with a modern, scalable open-source database system that meets their requirements for reliability, performance, tunable consistency, and robust CDC support. They ultimately chose <strong>TiDB</strong> as the solution. We are seeing more stories of companies exploring alternatives to HBase due to its limitations and maintenance overhead. <strong><a href="https://medium.com/pinterest-engineering/tidb-adoption-at-pinterest-1130ab787a10">&#8212;&gt; Read more</a></strong></p><h4>&#128073; Slack&#8217;s Migration to EMR 6</h4><p><strong>Slack</strong> discusses their migration from EMR 5 with Spark 2 to EMR 6 with Hive 3 and Spark 3 on AWS, highlighting the performance and reliability improvements achieved in their data pipelines, which are developed using Apache Spark and scheduled on Airflow. <strong><a href="https://slack.engineering/unlocking-efficiency-and-performance-navigating-the-spark-3-and-emr-6-upgrade-journey-at-slack/">&#8212;&gt; Read more</a></strong></p><p></p><div><hr></div><h2>&#128172; Community Discussions</h2><div><hr></div><p>There was a recent <strong><a href="https://www.reddit.com/r/dataengineering/comments/1eajbke/how_fast_data_engineering_is_moving_forward/">discussion on Reddit</a></strong> on <strong>how fast data engineering is progressing</strong>. The consensus among most commenters is that while tools, storage systems, and processing frameworks may evolve rapidly, the fundamentals of data engineering remain consistent. These fundamentals include data integration, data modeling, and the processes of extracting, transforming, and loading data (ETL).</p><p>For <strong>aspiring data engineers</strong>, the key takeaway is to invest time and effort in mastering these basics and fundamentals rather than focusing solely on becoming an expert in specific tools. While vendors are striving to automate the data engineering lifecycle as much as possible (as seen with the latest Databricks offerings), a strong understanding of the fundamentals will always be valuable and help you stand out in the field.</p><p></p><div><hr></div><h2>&#128227; Vendors News &amp; Announcements</h2><div><hr></div><h4>&#128073; Snowflake's New Cortex Search Feature</h4><p><strong>Snowflake</strong> announced a new feature called <strong>Cortex Search</strong> (currently in Public Preview) in July 2024. This search service is designed for unstructured data, such as text, and enables enterprises to deploy <strong>Retrieval-Augmented Generation (RAG)</strong> applications using Snowflake, allowing them to customise generative AI applications with proprietary data. <strong><a href="https://www.snowflake.com/en/blog/cortex-search-ai-hybrid-search/">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Databricks Mosaic AI Model Training</h4><p>Around the same time, <strong>Databricks</strong> announced the support for <strong>Mosaic AI Model Training</strong>, which streamlines the fine-tuning of general-purpose open-source LLM and GenAI models, such as Llama 3 and Mistral, using enterprise data. Databricks recommends a new approach for training LLM models with enterprise data called <a href="https://arxiv.org/abs/2403.10131">Retrieval Augmented Fine-tuning (RAFT)</a>, which combines both Retrieval-Augmented Generation (RAG) and model fine-tuning. <strong><a href="https://www.databricks.com/blog/introducing-mosaic-ai-model-training-fine-tuning-genai-models">&#8212;&gt; Read more</a></strong></p><h4>&#128073; Release of Confluent Platform 7.7</h4><p><strong>Confluent</strong> announced the release of Confluent Platform 7.7, built on Apache Kafka 3.7. This update introduces significant features, including <strong>Confluent Platform for Apache Flink</strong>, a fully managed and serverless stream processing service (currently in Limited Availability), as well as a self-managed HTTP Source connector for ingesting data from external APIs. <strong><a href="https://www.confluent.io/blog/introducing-confluent-platform-7-7/">&#8212;&gt; Read more</a></strong></p><h2></h2><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[DLD #1 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Curated Knowledge on Data Engineering Landscape]]></description><link>https://www.pracdata.io/p/dld-1-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-1-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 25 Aug 2024 12:26:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Introduction</h2><p>Welcome to the <strong>Data Engineering Digest (DLD)</strong> newsletter series. This will be a periodic roundup of the latest and greatest in the world of data &amp; data engineering in particular. We'll deliver a curated selection of finest news articles, blog posts, tutorials, discussions, and more within the data landscape.</p><div><hr></div><h2>&#10024; Featured</h2><div><hr></div><p>In a recent <strong><a href="https://practicaldataengineering.substack.com/i/147610179/non-open-vs-open-data-lakehouse">blog post</a></strong> I explored the difference between generic <strong>data lakehouses</strong> offered by some vendors, and <strong>open data lakehouses</strong> built on the foundation of open source tools and technologies. The Apache Hudi blog published a post providing an overview of the <strong>open data lakehouse architecture</strong>. It details the architecture layers, components, key technologies, advantages over previous data lake architectures, and use cases for implementing an open data lakehouse.  <strong><a href="https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/">&#8212;&gt; Read more</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NMyZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NMyZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 424w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 848w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1272w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png" width="1456" height="789" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:789,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;/assets/images/blog/dlh_new.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="/assets/images/blog/dlh_new.png" title="/assets/images/blog/dlh_new.png" srcset="https://substackcdn.com/image/fetch/$s_!NMyZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 424w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 848w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1272w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>&#128225; Open Source News</h2><div><hr></div><h4>&#128073;  Debezium Latest Official Release</h4><p><strong>Debezium 2.7.0.Final</strong> has been released. Some outstanding 140 issues have been fixed along with many new features and improvements in the core component as well as the stand-alone connectors.  <strong><a href="https://debezium.io/blog/2024/07/01/debezium-2-7-final-released/">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Submission of OneTable to Apache Foundation</h4><p><strong>OneHouse</strong> news on submission of <strong>XTable</strong> (formerly known as OneTable) to the Apache Software Foundation Incubator was a major recent announcement. Top cloud vendors such as Microsoft and Google are already integrating XTable into their analytics platforms, <strong>Microsoft Fabric</strong> and <strong>BigLake</strong> respectively,  to provide a unified logical lakehouse with interoperability between different open table formats. <strong><a href="https://www.onehouse.ai/blog/open-data-foundations-with-apache-xtable-hudi-delta-and-iceberg-interoperability">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128736; Practical Data Engineering</h2><div><hr></div><h4>&#128073; Spark Repartition() Function</h4><p>Spark pipelines often employ the <code>df.repartition()</code> function to optimise data processing, especially by consolidating small partitions before loading data into target storage. It's essential to remember that repartitioning in Spark is a <em>unit of parallelism</em> and data distribution used by the distributed compute engine, not necessarily a full bucketing or SQL-style <code>group by</code> operation. A blog post explains how Spark repartitioning works and what it actually does. <strong><a href="https://python.plainenglish.io/the-truth-about-pysparks-repartition-prepare-to-be-surprised-4dede792f3f4">--&gt; Read more</a></strong></p><h4>&#128073; Apache XTable + Airflow</h4><p><strong>AWS</strong> published a blog post demonstrating how to use <strong>Apache XTable</strong> to convert open table format metadata to other formats. The blog post features a custom Airflow operator, <code>XtableOperator()</code>, designed for batch pipeline translations on the AWS platform. The operator's code is available on <strong><a href="https://github.com/aws-samples/apache-xtable-on-aws-samples">Github</a></strong>. This development suggests that <em>unified open table format</em> adoption is gaining some momentum. <strong><a href="https://aws.amazon.com/blogs/big-data/run-apache-xtable-on-amazon-mwaa-to-translate-open-table-formats/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#9881;&#65039; Technical Deep Dive</h2><div><hr></div><h4>&#128073;  Apache Paimon&#8217;s Internal Design</h4><p>While the 'big three' open table formats &#8211; Apache Hudi, Iceberg, and Delta Lake &#8211; dominate the market and discussions, <strong>Apache Paimon</strong>, a more recent &#8220;<strong>Flink table format</strong>&#8221;, has received less attention. If you're interested in learning more about Apache Paimon, a comprehensive blog post by Giannis delves into its design goals, internals, and key features. <strong><a href="https://medium.com/@ipolyzos_/the-majesty-of-apache-flink-and-paimon-d36e73571fc9">&#8212;&gt; Read more</a></strong></p><p></p><h4> &#128073;  How dlt Works Under the Hood</h4><p>Discussions have been ongoing regarding the potential use of new open source data ingestion tools like <strong>Data Load Tool (dlt)</strong> as a replacement for more established ones like <strong>Airbyte</strong> in certain use cases (ex API data integration). A dlt project contributor has published a blog post detailing the internal data pipeline design and core functions of dlt, including data extraction, normalisation, and loading. The latest version leverages Apache Arrow's efficient in-memory data structure to optimise the entire pipeline. <strong><a href="https://dlthub.com/blog/how-dlt-uses-apache-arrow">&#8212;&gt; Read more</a></strong></p><p></p><h4>&#128073;  Yet Another Kafka Explanation</h4><p>There's a wealth of resources available explaining Kafka's architecture and internals. I found a recent blog post series on the topic, which provides clear and concise explanations of the concepts, accompanied by helpful visuals for those unfamiliar with Kafka's design and operation.</p><p><strong><a href="https://blog.det.life/apache-kafka-overview-b04c4ab8ef49">Kafka Architecture Overview</a></strong> | <strong><a href="https://blog.det.life/apache-kafka-important-designs-2a0e6aa6c5bf">Design elements</a></strong> | <strong><a href="https://blog.det.life/apache-kafka-producer-db3b177f65d2">Kafka Producer</a></strong> | <strong><a href="https://blog.det.life/apache-kafka-consumer-d902e3589679">Kafka Consumer</a></strong></p><p></p><h4>&#128073; DuckDB's Internal Memory and Buffer Management</h4><p>If you've used <strong>DuckDB</strong> or are exploring its capabilities, you might wonder how it handles large datasets without memory limitations, a common issue with some Python dataframes like Pandas. DuckDB&#8217;s official blog has recently covered the engine's internal memory and buffer management. It explains how DuckDB leverages streaming execution to process queries without fully loading CSV or Parquet files into memory, and utilises disk spilling when intermediate results exceed memory capacity. <strong><a href="https://duckdb.org/2024/07/09/memory-management.html">&#8212;&gt; Read more</a></strong></p><p></p><h4>&#128073;  Snowflake's Micro Partitioning Internal Design</h4><p>If you've been using <strong>Snowflake</strong> at your company, you're likely familiar with its internal partitioning feature called <em><strong><a href="https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions">micro-partitioning</a></strong></em>. This process automatically divides tables into micro-partitions of 50 MB and 500 MB, organising data in a columnar format within each micro-partition. A concise and excellent blog post provides a clear explanation of micro-partitioning's internal design, complete with helpful visuals. <strong><a href="https://medium.com/@saisuman.singamsetty/snowflake-micro-partitions-the-future-of-data-storage-and-retrieval-4e41312708b4">&#8212;&gt; Read more</a></strong></p><p></p><h4> &#128073; Overview of Kafka's Tiered Storage Design</h4><p><strong>Kafka 3.6</strong>, released in 2023, introduced a highly anticipated feature: <strong>Tiered Storage</strong>. This feature currently supports <em>local</em> and <em>remote</em> storage tiers, enabling the movement of inactive segments to a configurable deep storage solution like HDFS or S3 based on local retention settings. This provides a cost-effective and scalable way to retain historical data. <strong>Uber</strong> is credited with driving the tiered storage proposal <strong><a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage">[KIP-405]</a></strong>, discussing the internals of the tiered storage architecture. <strong><a href="https://www.uber.com/en-AU/blog/kafka-tiered-storage/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128270; Case Studies</h2><div><hr></div><h4> &#128073;  Implementation of dlt in Production</h4><p>It's inspiring to hear about individuals and teams embracing new open-source tools and technologies in production environments. <strong>Data Load Tool (dlt)</strong> is a relatively new ETL library covered earlier, that has been adopted by some for production workloads. Alexander from Dataops explores the advantages and disadvantages of using dlt compared to more established data integration tools like Airbyte. <strong><a href="https://medium.com/dataops-tech/data-load-tool-dlt-pros-cons-and-integration-into-data-platform-as-an-ingest-tool-eb34311a2007">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Notion's Data Architecture Evolution</h4><p><strong>Notion</strong> has unveiled their new data lakehouse architecture. They chose Apache Hudi as their table format due to its efficient incremental data ingestion capabilities, making it suitable for their update-heavy workloads. The architecture also incorporates event-based CDC ingestion using Debezium and Kafka. <strong><a href="https://www.notion.so/blog/building-and-scaling-notions-data-lake">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128172; Community Discussions</h2><div><hr></div><p>&#128073; This <strong><a href="https://www.reddit.com/r/dataengineering/comments/1e7fcmx/what_i_would_do_if_had_to_relearn_data/">career advice</a></strong> from a senior data engineer highlights a key point in one of Reddit's highest-rated data engineering discussions in July: </p><div class="pullquote"><p><strong>Master the data engineering fundamentals first!</strong></p></div><p>While flashy tools and platforms come and go, a strong foundation in low-level skills like Bash, Git, SQL, pure Python development, and containerisation will take you much further. Juniors who prioritise these foundational skills before diving into advanced tools and stacks will be better-positioned for success.</p><div><hr></div><h2> &#127909; Conferences &amp; Events</h2><div><hr></div><p>&#128073;  The annual virtual <strong>PrestoCon 2024</strong> day organised by Linux Foundation/Presto Foundation took place in June, discussing topics like Presto 2.0 native C++ engine, and Presto usage at companies like Uber. A recap of the event and the main sessions is provided in this <strong><a href="https://prestodb.io/blog/2024/07/02/recap-of-prestocon-day-2024-presto-c-performance-new-connectors-use-cases-and-so-much-more/">article</a></strong>. All the recorded 24 sessions can be found on <strong><a href="https://www.youtube.com/playlist?list=PLJVeO1NMmyqUUj2UbRiwX8-Pmc7RNwDcY">Youtube</a></strong>.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The History and Evolution of Open Table Formats - Part II]]></title><description><![CDATA[From Hive to High Performance: A Journey Through the Evolution of Data Management on Data Lakes]]></description><link>https://www.pracdata.io/p/the-history-and-evolution-of-open-14d</link><guid isPermaLink="false">https://www.pracdata.io/p/the-history-and-evolution-of-open-14d</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 18 Aug 2024 08:42:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Z4uZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z4uZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 424w, https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 848w, https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png" width="1456" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:307891,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 424w, https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 848w, https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!Z4uZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34374137-353a-4960-aa19-1b3f52db64b6_4247x1639.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In <strong><a href="https://practicaldataengineering.substack.com/p/the-history-and-evolution-of-open">Part I</a></strong> we went over the origin and architecture of traditional table management systems and the first generation of Open Table Formats (OTF). In this final part, I will discuss the second and third generation OTFs.</p><p></p><h1>2nd Generation OTF - The Rise of Log-oriented Table Format</h1><p>Now that we have built a strong case for re-imagining and improving the open table format model, let's recap and list the major issues we identified with the previous generation of table formats:</p><ul><li><p>Tight coupling between physical partitioning and the logical partitioning scheme of the data.</p></li><li><p>Heavy reliance on the file system or object store API for listing files and directories during the query planning phase.</p></li><li><p>Relying on an external metadata store for maintaining table-level information such as schemas, partitions, and column-level statistics.</p></li><li><p>Lack of support for record-level upsert, merge and delete.</p></li><li><p>Lack of ACID and transactional properties.</p></li></ul><p>Let's temporarily set aside the complexities of upsert and ACID transactions to focus on first three fundamental challenges. Given these constraints, we must consider how to decouple partitioning schemes from physical file layouts, minimise file system API calls for file and partition listings, and eliminate the reliance on an external metadata store. </p><p>To address these requirements, we need a data structure capable of efficiently storing metadata about data, partitions, and file listings. This structure must be <strong>fast, scalable, and self-contained, with no dependencies on external systems.</strong></p><p>One solution to address these requirements is surprisingly simple, though not always the most obvious. Just as <strong>Jay Kreps</strong> and the engineering team at LinkedIn built <strong>Apache Kafka</strong> on the foundation of a simple append-only storage abstraction&#8212;an <strong><a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">immutable log</a></strong> containing sequential records of events ordered by time&#8212; can we consider using a similar framework? </p><p><strong>So the question is:</strong></p><blockquote><p>If <strong>immutable logs</strong> can store events representing facts that always remain true, effectively capturing the evolution of an application's state over time in systems like Apache Kafka, can't we apply the same basic principles to manage the state of table's metadata in our case?</p></blockquote><p>By leveraging log files, we can treat all metadata modifications as immutable, sequentially ordered events. This aligns with the <a href="https://martinfowler.com/eaaDev/EventSourcing.html">Event Sourcing</a> data modeling paradigm, where we capture state changes at the partition and file level within transactional logs stored alongside the data. </p><p>Files and partitions become the unit of record for which the metadata layer tracks all the state changes in the log. In this design, <strong>the metadata logs are the first class citizens of the metadata layer.</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h2>Lets build a Simple Log-oriented Table</h2><p>Let&#8217;s do a quick practical exercise to understand how we can design our new table format to capture and organise the metadata in log files.</p><p>In this exercise we will build a simple log-oriented metadata table format for capturing the filesystem and storage-level state changes such as adding and removing files and partitions, which can provide the event log primitives such as strong ordering, versioning, time travel and replaying event to rebuild the stage. </p><p>For capturing storage-level or file system state changes we need to consider two main file system objects, that is files and directories (i.e partitions) with following possible events:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S2fU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S2fU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 424w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 848w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1272w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S2fU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png" width="445" height="168" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:168,&quot;width&quot;:445,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S2fU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 424w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 848w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1272w, https://substackcdn.com/image/fetch/$s_!S2fU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487a67c0-0a0e-4ca5-8ce0-2f90ecd02561_445x168.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><em>       *Renames can be treated as two events, a removal and an addition</em></figcaption></figure></div><p>Let&#8217;s assume a particular table contains three partitions in <code>/year=/month=/day=</code> format. In a most simple form, the metadata log can be implemented with following fields in an immutable log file:</p><pre><code><code>timestamp|object|event type|value
20231015132000|partition|add|/year=2023/month=10/day=15
20231015132010|file|add|/year=2023/month=10/day=15/00001.parquet
20231015132011|file|add|/year=2023/month=10/day=15/00002.parquet
20231015132011|file|add|/year=2023/month=10/day=15/00003.parquet</code></code></pre><p>Later if a file is removed, a new remove event can be captured at the end of the log file:</p><pre><code>20231015132011|file|remove|/year=2023/month=10/day=15/00003.parquet</code></pre><p></p><h3>Managing Metadata Updates</h3><p>Given the immutable nature of data lake storage systems like HDFS or object stores, metadata logs cannot be continuously appended to. Instead, each update resulting from data manipulation operations (e.g., new data ingestion) requires the creation of a new metadata file. </p><p>To maintain sequence and facilitate table state reconstruction, these metadata logs can be sequentially named and organised within a base metadata directory.</p><pre><code>/mytable/
&#9;/metadata_logs/
&#9;&#9;000001.log
&#9;&#9;000002.log
&#9;&#9;000003.log
&#9;&#9;000004.log</code></pre><p>In order to rebuild the current table state, the order of the metadata logs in the metadata directory, and possibly the <em>Timestamp</em> field in the logs, can serve as the physical or logical clock which can provide strong sequential ordering semantics for replaying metadata events. </p><p>The query engines can scan the event log sequentially to replay all the metadata state change events in order to rebuild the current snapshot view of the table.</p><p></p><h3>Log Compaction</h3><p>Frequent data updates on large datasets can lead to a proliferation of metadata log files, as each change necessitates a new log entry. </p><p>Over time, the overhead of listing and processing these files during state reconstruction can become a performance bottleneck, negating the benefits of decoupling metadata management.</p><p>To mitigate this, a compaction process can merge individual log files into a consolidated file, removing obsolete records like superseded add and remove events. However, for time travel and rollback capabilities, these outdated events must be retained for a specified period.</p><p>By periodically executing background compaction jobs, we can generate snapshot logs encapsulating all essential state changes up to a specific point in time.</p><pre><code>/mytable/
&#9;/metadata_logs/
&#9;&#9;000001.log
&#9;&#9;000002.log
&#9;&#9;000003.log
&#9;&#9;000004.log
&#9;&#9;snapshot_000004.log
&#9;&#9;000005.log</code></pre><p>In the above example, the snapshot log <code>snapshot_000004.log</code> has been generated for sequential log files <code>000001.log</code> to <code>000004.log</code> containing all the metadata transactions up to that point. To get the current table snapshot view, the latest snapshot file along with any additional new delta log files need to be scanned, which is now more optimised and efficient.</p><div><hr></div><h3>What did we just build?</h3><p>We've successfully designed a foundational log-oriented table format that addresses our initial requirements by using simple, immutable transactional logs to manage table metadata alongside data files. </p><p>This approach serves as the bedrock for modern open table formats like <strong>Apache Hudi</strong>, <strong>Delta Lake</strong>, and <strong>Apache Iceberg</strong>.</p><p>Essentially:</p><div class="pullquote"><p><strong>The modern open table formats provide a mutable table abstraction layer on top of immutable data files through a log-based metadata layer, offering database-like features such as ACID compliance, upserts, table versioning, and auditing.</strong></p></div><p>This architectural shift marks a significant departure from previous table implementations by eliminating heavy reliance on the underlying storage system's metadata API, a potential performance bottleneck in large-scale data lakes. </p><p>By abstracting the physical file layout and tracking the table state (including partitions) at the file level within the metadata layer, these formats decouple logical and physical data organisation using the log-oriented metadata layer as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4yrN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4yrN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 424w, https://substackcdn.com/image/fetch/$s_!4yrN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 848w, https://substackcdn.com/image/fetch/$s_!4yrN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 1272w, https://substackcdn.com/image/fetch/$s_!4yrN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4yrN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png" width="1026" height="848" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:848,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19793,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4yrN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 424w, https://substackcdn.com/image/fetch/$s_!4yrN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 848w, https://substackcdn.com/image/fetch/$s_!4yrN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 1272w, https://substackcdn.com/image/fetch/$s_!4yrN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1233a0f-f83c-4d8c-86ac-7e11d765153e_1026x848.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Open Table Format Architecture</figcaption></figure></div><p></p><h3>What about query performance?</h3><p>In this architecture, the query performance is directly affected by how fast the required metadata can be retrieved and scanned during the query planning phase. </p><p>Using the underlying storage fast sequential I/O for reading metadata files provides much better performance that using their metadata APIs for gathering the required details such as the list of all sub-directories (partitions), files and retrieval of column-level statistics either from the footer section of the data files, or from the external metadata engine. </p><h3>Is this a Novel Design?</h3><p>The concept of using metadata files to track data files and associated metadata isn't entirely novel. </p><p>Key-value stores like <strong>RocksDB</strong> and <strong>LevelDB</strong> employ a similar approach, using <em>manifest files</em> to keep track of <strong>SSTables</strong> (data segments in <a href="https://practicaldataengineering.substack.com/p/internal-storage-design-of-modern">LSM-Tree storage model</a>) and their corresponding key ranges. These manifest files are cached in memory, enabling rapid identification of relevant SSTables without exhaustive directory scans suing the underlying storage APIs [1].</p><p>I wonder if those smart engineers behind the modern open table formats drew any inspirations from metadata management design in storage systems like RocksDB!</p><p></p><h3>Adding Additional Feature</h3><p>By adopting an event log and event sourcing model, we can readily implement additional valuable primitives:</p><ul><li><p><strong>Event Replay</strong> - The ability to replay file and directory change event logs up to a specific version.</p></li><li><p><strong>Full State Rebuild</strong> - Compute engines can reconstruct the table's current state and identify active files and partitions by processing the metadata event log.</p></li><li><p><strong>Time Travel</strong> - Similar to event-based systems, we can revert to previous table versions using the event log and versioning mechanism.</p></li><li><p><strong>Event-Based Streaming Support</strong> - The transactional log inherently functions as a message queue, enabling the creation of streaming pipelines without relying on separate message buses.</p></li></ul><p>Recall how Apache Hive manages column-level statistics (e.g., min/max values) for each table partition by storing records in a metadata database to optimize query performance. While binary formats like ORC and Parquet include file-level statistics, eliminating the need to query the Metastore, this approach still requires scanning and loading file footers during query planning, impacting scalability.</p><p>To address this, we can leverage our log-based metadata layer to store additional statistical metadata, optimising query performance by avoiding external system interactions and extensive file footer scans. </p><p>By consolidating file-level statistics into a small set of index files, we aim to reduce the I/O overhead associated with query planning from linear scaling (O(n)) to near-constant time (O(1)).</p><p>We could essentially follow the same metadata organisation, but use different naming conventions to manage column stats index. For each new data file loaded, a new <em><strong>delta index log</strong></em> can be generated to save the column stats records. When a compaction job runs to consolidate the metadata logs, it can also perform compaction on the column index logs to generate a snapshot file.</p><pre><code>/mytable/
&#9;/metadata_logs/
&#9;&#9;stats_000001.log
&#9;&#9;stats_000002.log
&#9;&#9;stats_000003.log
&#9;&#9;stats_000004.log
&#9;&#9;stats_snapshot_000004.log
&#9;&#9;stats_000005.log</code></pre><p></p><h3>Using a Consolidated Log File</h3><p>By adopting a more structured file format capable of handling nested structures like JSON or Avro, we can optimise our design by consolidating all metadata within a single metadata file. This unified approach simplifies metadata management and reduces I/O overhead compared to managing multiple log files. </p><p>Furthermore, a single schema can be used to encapsulate different metadata types, streamlining the overall structure. To differentiate between metadata sets, we can employ a nested structure with distinct keys similar to following:</p><pre><code>files:[
  {timestamp:20231015132000,type:partition,action:add,details:/.  data=20231015},
  {timestamp:20231015132000,type:file,action:add,details:/data=20231015/00001.parquet}
]

stats:[
  {timestamp:20231015132000,partition:/data=20231015,filename:00001.parquet,column_name:price,column_type:float,min:5,max:20},
  {timestamp:20231015132000,partition:/data=20231015,filename:00001.parquet,column_name:product,column_type:string,min:book,max:pen},
]</code></pre><p>The above design is similar to how <strong>Apache Hudi</strong> manages the <a href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331">metadata in HFiles format</a>. Another possible format is to use a single object for each data file containing all the related metadata using nested entries:</p><pre><code>{
  "timestamp": 20231015132000,
  "type": "file",
  "action": "add",
  "details": "/data=20231015/00001.parquet"
  "column_stats":[
     {"column_name":"price","column_type":"float","min":5,"max":20},  {"column_name":"product","column_type":"string","min":"book","max":"pen"}
  ]
}</code></pre><p>This is the format that <strong>Delta Lake</strong> uses by storing column-level statistics as a nested structure inside the main JSON transactional logs, under <a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md#Per-file-Statistics">stats index.</a> </p><p></p><h3>Adding ACID Guarantees</h3><p>A core design objective of open table formats is to enable ACID guarantees through the metadata layer. The new log-structured metadata approach inherently supports functions such as versioning and <strong>Snapshot Isolation</strong> via <strong>MVCC</strong>, addressing the previously discussed transaction isolation challenges in data lakes.</p><p>To provide Snapshot Isolation, writes can occur in following two steps:</p><ol><li><p>Optimistically create or replace data files, or delete existing files on the underlying storage.</p></li><li><p>Atomically update the metadata transaction log with the newly added or removed files, generating a new metadata version.</p></li></ol><p>This transactional mechanism prevents readers from encountering incomplete or corrupt data, a common issue in the previous table format generation, ensuring data integrity. By bypassing file system listing operations, we eliminate consistency issues like <em><strong>list-after-write</strong></em> on some object stores.</p><p>All three major table formats (<strong>Hudi</strong>, <strong>Delta Lake</strong>, <strong>Iceberg</strong>) implement <strong>MVCC</strong> with snapshot isolation to provide read-write isolation and versioning. They maintain multiple table versions as data changes, allowing readers to select files from the most recent consistent snapshot using the transaction log. </p><p><strong>Multi-Write Concurrency</strong> can be facilitated through <strong>Optimistic Concurrency Control (OCC)</strong>, which validates transactions before committing to detect potential conflicts. If concurrent writes target non-overlapping file sets, they can proceed independently. </p><p>However, if there's overlap, only one write succeeds while others are aborted during conflict resolution. All three major table formats employ some form of Optimistic Concurrency Control to manage concurrent writes and identify conflicts effectively.</p><p></p><h2>The Origin of Modern Open Table Formats implementations</h2><p>As previously discussed, the current generation of open table formats emerged to address the limitations of the previous generation of data management approaches on data lakes, and the foundation of these tools lies in the log-structured metadata organisation explored earlier.</p><ul><li><p><strong>Apache Hudi,</strong> <a href="https://www.uber.com/en-AU/blog/uber-big-data-platform/">initiated by Uber in 201</a>6, primarily aimed to enable scalable, incremental upserts and streaming ingestion into data lakes, while providing ACID guarantees on HDFS. Its design is heavily optimised for handling mutable data streams. The traditional snapshot and batch ingestion patterns used with Hive-style tables proved inadequate for low-latency use cases. An incremental approach that focused on new and updated data was necessary, but the immutability of HDFS posed challenges.</p></li><li><p><strong>Apache Iceberg</strong> originated at <strong>Netflix</strong> around 2017 in response to the scalability and transactional limitations of Hive's schema-centric, directory-oriented table format. The realisation that incremental improvements to Hive were insufficient drove the development of a new solution by changing the table design to instead track data in a table at the file level by pointing the table to an ordered list of files. Iceberg was born from this insight and employs a manifest-based metadata layer consisting of metadata, manifest list, and manifest files organised hierarchically.</p></li></ul><ul><li><p><strong>Delta Lake</strong>, introduced by <strong>Databricks</strong> in 2017 and open-sourced in 2019, emerged as the third major open table format. Its primary goal was to provide ACID transaction capabilities atop cloud object store-based data lakes. This was motivated by the absence of ACID guarantees, including cross-object consistency and query isolation, within cloud object stores. </p></li><li><p><strong>Apache Paimon</strong> is another notable and fairly recent open table format developed by the Apache Flink community in 2022, as the "<strong>Flink Table Store</strong>" and a lakehouse streaming storage layer with the main design goal of handling high throughput and low latency streaming data ingestion. However it has yet to gain any significant traction in comparison to the dominant trio.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2oqX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2oqX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 424w, https://substackcdn.com/image/fetch/$s_!2oqX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 848w, https://substackcdn.com/image/fetch/$s_!2oqX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 1272w, https://substackcdn.com/image/fetch/$s_!2oqX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2oqX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png" width="1456" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84811,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2oqX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 424w, https://substackcdn.com/image/fetch/$s_!2oqX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 848w, https://substackcdn.com/image/fetch/$s_!2oqX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 1272w, https://substackcdn.com/image/fetch/$s_!2oqX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5429ef-96c2-4309-acce-a5a887b559b4_1740x954.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These projects have significantly streamlined data management for users by automating optimisations, compaction, and indexing processes. This relieves data engineers from the burden of complex low-level physical data management tasks. </p><p>In <a href="https://practicaldataengineering.substack.com/p/the-history-and-evolution-of-open">Part 1 </a>we questioned whether we could build a system that can combine the benefits of the traditional monolithic DBMS and disaggregated data lake systems. We can now declare that what we have is:</p><blockquote><p>A powerful combination of the <strong>encapsulation</strong> and <strong>abstraction</strong> found in traditional DBMS physical layers with the <strong>openness</strong>, <strong>interoperability</strong>, and <strong>flexibility</strong> of modern open table formats.</p></blockquote><p></p><h2>Industry Adoption</h2><p>The past few years have witnessed widespread adoption and integration of next-generation open table formats across various data tools and platforms. </p><p>All the major open table formats have gained <a href="https://medium.com/@kywe665/delta-hudi-iceberg-which-is-most-popular-29ca56767199">traction and popularity</a> while a fierce competition for market dominance has been going on mainly by the SaaS vendors providing these products as a managed service. </p><p>Major cloud providers have also embraced one or all of the big three formats, with <strong>Microsoft</strong> fully committing to Delta Lake for its latest <strong>OneLake</strong> and <strong>Microsoft Fabric</strong> analytics platforms, and <strong>Google</strong> adopting Iceberg as the primary table format for its <strong>BigLake</strong> platform. <strong>Cloudera</strong>, a leading Hadoop vendor, has also built its open data lakehouse solution around Apache Iceberg.</p><p>Prominent open source compute engines like <strong>Presto</strong>, <strong>Trino</strong>, <strong>Flink</strong>, and <strong>Spark</strong> now support reading and writing to these open table formats. Additionally, major MPP and cloud data warehouse vendors, including Snowflake, BigQuery, and Redshift, have incorporated support through external table features. </p><p>Beyond these tools and platforms, numerous companies have publicly documented their migration to open table formats.</p><p></p><h1>3rd Generation OTF - Unified Open Table Format</h1><p>The evolution of open table formats has marched on with a new trend since last year: <strong>cross-table interoperability</strong>. </p><p>This exciting development aims to create a <strong>unified and universal open table format</strong> that seamlessly works with all major existing formats under the hood.</p><p>Currently, converting between formats requires metadata translation and data file copying. However, since these formats share a foundation and often use Parquet as the default serialisation format, significant opportunities for interoperability exist.</p><p>A <strong>uniform metadata layer</strong> promises a unified approach for reading and writing data across all major open table formats. Different readers and writers would leverage this layer to interact with the desired format, eliminating the need for manual format-specific metadata conversion or data file duplication.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A37a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A37a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 424w, https://substackcdn.com/image/fetch/$s_!A37a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 848w, https://substackcdn.com/image/fetch/$s_!A37a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 1272w, https://substackcdn.com/image/fetch/$s_!A37a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A37a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png" width="1350" height="857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:857,&quot;width&quot;:1350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:157517,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A37a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 424w, https://substackcdn.com/image/fetch/$s_!A37a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 848w, https://substackcdn.com/image/fetch/$s_!A37a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 1272w, https://substackcdn.com/image/fetch/$s_!A37a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05899b93-40dc-4089-8ed9-a59fd89379ac_1350x857.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The State of Art</h3><p><strong>LinkedIn</strong> engineers pioneered one of the earliest attempts at a unified table API with <strong><a href="https://www.linkedin.com/blog/engineering/data-management/taking-charge-of-tables--introducing-openhouse-for-big-data-mana">OpenHouse</a></strong><a href="https://www.linkedin.com/blog/engineering/data-management/taking-charge-of-tables--introducing-openhouse-for-big-data-mana"> introduced in 2022</a>. Built on top of Apache Iceberg, <strong>OpenHouse</strong> offered a simplified interface for interacting with tables, regardless of their underlying format, through a RESTful Table Service seamlessly integrated with Spark.</p><p>While <strong>OpenHouse</strong> was a great effort, it lacked comprehensive interoperability and format conversion capabilities. Additionally, its <a href="https://www.linkedin.com/blog/engineering/open-source/open-sourcing-openhouse">open-sourcing in 2024</a> came relatively late compared to other emerging projects that had already gained significant traction, specially with giant tech companies such as Databricks, <a href="https://venturebeat.com/data-infrastructure/exclusive-microsoft-and-google-join-forces-on-onetable-an-open-source-solution-for-data-lake-challenges/">Microsoft and Google</a> backing following projects.</p><p><strong><a href="https://xtable.apache.org/">Apache XTable</a></strong> (formerly known as <strong>OneTable</strong>), introduced by <strong>OneHouse</strong> in 2023, provides a lightweight abstraction layer for generating metadata for any supported format using common models for schemas, partitioning details, and column statistics. In terms of metadata layout, <strong>XTable</strong> stores metadata for each format side-by-side within the metadata layer.</p><p>XTable uses the latest snapshot of the primary table format, and generates additional metadata for target formats. Consumers can either use the primary format, or the target formats to read and write and get the same consistent view of the table&#8217;s data.</p><p><strong>Databricks</strong> introduced <strong><a href="https://www.databricks.com/blog/delta-uniform-universal-format-lakehouse-interoperability">Delta UniForm</a></strong> in 2023. <strong>Delta UniForm</strong> automatically generates metadata for Delta Lake and Iceberg tables while maintaining a single copy of shared Parquet data files. It's important to note that <strong>UniForm</strong>, primarily sponsored by Databricks, seems focused on using Delta Lake as the primary format while enabling external applications and query engines to read other formats.</p><h3>How do they compare?</h3><p>LinkedIn's OpenHouse project offers more of a control plain than a unified table format layer. </p><p>Comparing Apache XTable to Delta Uniform, XTable takes a broader approach, aiming for full interoperability and allowing users to mix and match read/write features from different formats regardless of the primary format chosen. </p><p>As an example, XTable could enable incremental data ingestion into a Hudi table (leveraging its efficiency) while allowing data to be read using Iceberg format by query engines like Trino, Snowflake, or BigQuery.</p><p>That being said, we're still in the early phases of development of uniform table format APIs. It will be exciting to see how they progress over the coming months.</p><h1>Data Lakehouse</h1><p>That brings us to the last part of this blog post to explore the concept of a data lakehouse without which our discussion would be incomplete. Let&#8217;s define what a data lakehouse stands for:</p><blockquote><p><strong>A data lakehouse represents a unified, next-generation data architecture that combines the cost-effectiveness, scalability, flexibility and openness of data lakes, with the performance, transactional guarantees and governance features typically associated with data warehouses.</strong> </p></blockquote><p>That definition sounds very similar to what open table formats stand for! That&#8217;s because the lakehouse foundation is based on leveraging open table formats for implementing ACID, auditing, versioning, and indexing directly on low-cost cloud storage, to bridge the gap between these two traditionally distinct data management paradigms.</p><p>In essence, data lakehouse enables organisations to treat data lake storage as if it were a traditional data warehouse and vice versa. They offer the flexibility and decoupled architecture of data lakes&#8212;allowing for the storage of unstructured and semi-structured data in open formats and the use of diverse compute engines&#8212;combined with the performance, transactional capabilities, and full CRUD operations characteristic of data warehouses.</p><p>This vision was initially pursued by <strong>SQL-on-Hadoop</strong> tools to bring data warehousing to Hadoop platforms, but only getting fully realised recently with the advancements in the data landscape.</p><h2>Non-Open vs Open Data Lakehouse</h2><p>It's crucial to differentiate between a general "<strong>data lakehouse</strong>" and an "<strong>open data lakehouse</strong>". </p><p>While top cloud vendors like AWS and Google often label their data warehouse-centric platforms as data lakehouses, their definition is broader. </p><p>Their emphasis is on their data warehouses' ability to store semi-structured data, support external workloads like Spark, enable ML model training, and query open data files&#8212;all characteristics traditionally associated with data lakes. These platforms also typically feature decoupled storage and compute architectures.</p><p>It was around 2020-2021 Amazon started promoting a <a href="https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/">lake house concept</a> comprised of <strong>Amazon Redshift</strong> data warehouse implemented over new RA3 managed storage, plus <strong>Redshift Spectrum</strong>, before the wider adoption of the "<em>lakehouse</em>" term by other vendors like <strong>Databricks</strong> and the data community in general, as a new approach to data warehousing.</p><p><strong>Google</strong> has similarly promoted its analytics lakehouse architecture, outlined in a <a href="https://cloud.google.com/blog/products/data-analytics/understanding-a-google-cloud-analytics-lakehouse">whitepaper</a> published in 2023, providing a blueprint for building a <strong>unified analytics lakehouse</strong> using either BigQuery as the first choice, or open Apache Iceberg and BigLake platform.</p><p>On the other hand, <strong>open data lakehouse</strong> primarily leverage open table formats to manage data on low-cost data lake storage. This architecture promotes higher interoperability and flexibility, allowing organisations to select the optimal compute and processing engine for each job or workload. </p><p>By eliminating the need to duplicate and move data across systems, open data lakehouses ensure that all data remains in its original, open format, serving as a <em>single source of truth</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BZ0R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BZ0R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 424w, https://substackcdn.com/image/fetch/$s_!BZ0R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 848w, https://substackcdn.com/image/fetch/$s_!BZ0R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 1272w, https://substackcdn.com/image/fetch/$s_!BZ0R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BZ0R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png" width="1290" height="984" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:984,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189170,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BZ0R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 424w, https://substackcdn.com/image/fetch/$s_!BZ0R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 848w, https://substackcdn.com/image/fetch/$s_!BZ0R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 1272w, https://substackcdn.com/image/fetch/$s_!BZ0R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f503e10-da2e-45b9-a3fc-c6e4717574be_1290x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Open Data Lakehouse Architecture</figcaption></figure></div><p>Vendors such as Databricks, Microsoft OneLake, OneHouse, Dremio, and Cloudera have positioned themselves as providers of managed open data lakehouse platforms on cloud.</p><h1>Conclusion</h1><p>This post series has covered a lot of ground, taking you on a journey through the evolution of data. </p><p>I am personally always interested in understanding how a technology came to be, the major architectural changes and evolutions it underwent, and the design goals and motivations behind it. </p><p>I hope you have enjoyed the ride and now have a better understanding of where we are in the technology timeline and how we got here.</p><p></p><h1>References</h1><p>[1] Dong, S., Callaghan, M., Galanis, L., Borthakur, D., Savor, T., &amp; Strum, M. (2017, January). Optimizing Space Amplification in RocksDB. In <em>CIDR</em> (Vol. 3, p. 3).</p>]]></content:encoded></item><item><title><![CDATA[The History and Evolution of Open Table Formats - Part I]]></title><description><![CDATA[From Hive to High Performance: A Journey Through the Evolution of Data Management on Data Lakes]]></description><link>https://www.pracdata.io/p/the-history-and-evolution-of-open</link><guid isPermaLink="false">https://www.pracdata.io/p/the-history-and-evolution-of-open</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Thu, 15 Aug 2024 06:27:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0mek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0mek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0mek!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 424w, https://substackcdn.com/image/fetch/$s_!0mek!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 848w, https://substackcdn.com/image/fetch/$s_!0mek!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!0mek!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0mek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png" width="1456" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:307891,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0mek!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 424w, https://substackcdn.com/image/fetch/$s_!0mek!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 848w, https://substackcdn.com/image/fetch/$s_!0mek!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 1272w, https://substackcdn.com/image/fetch/$s_!0mek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6df566-8943-45d9-8eff-7ef40a6af7a6_4247x1639.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you have been following trends in data engineering landscape over the past few years surely you have been hearing a lot about <em><strong>Open Table Formats</strong></em> and <em><strong>Data Lakehouse</strong></em>, if not already working with them! But what is all the hype about table formats if they have always existed and we have always been working with tables when dealing with structured data in any application? </p><p>In this blog post, we will delve into the history and evolution of open table formats within the data landscape. We will explore the challenges that led to their inception, the key innovations that have defined them, and the impact they have had on the industry. </p><p>By understanding the journey from traditional database management systems to the modern open table formats, we can better appreciate the current state of data technology and anticipate future trends.</p><p>In <strong>Part I</strong>, we will discuss the origin and history of storing and managing data in tabular format, and the emergence of first generation open table format.</p><p>In <strong><a href="https://open.substack.com/pub/practicaldataengineering/p/the-history-and-evolution-of-open-14d?r=23jwn&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part II</a></strong>, Second and third generation open table formats will be discussed.</p><h1>The origin of Table Formats</h1><p>Presenting information in a two-dimensional tabular format has been the most fundamental and universal method for displaying structured data, with roots tracing back over 3500 years to the old Babylonian period when the <a href="https://www.datafix.com.au/BASHing/2020-08-12.html">most ancient table data</a> were recorded on clay tablets.</p><p>The modern concept of database tables emerged with the invention of relational databases, inspired by <a href="https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf">E.F. Codd's paper </a>on the <strong>Relational Model</strong> published in 1970. </p><p>Since then, table formats have been the primary abstraction for managing and working with structured data in relational database management systems, such as the pioneering System R. Thus, the concept of table formats in storage systems is not novel, having been a staple for the past half-century.</p><h2>Table Format Abstraction</h2><p>Data tables are logical datasets, an abstraction layer over physical data files stored on disk, providing a unified, two-dimensional tabular view of records. The storage engine combines records from various objects for a dataset and presents them as one or more logical tables to the end user. </p><p>This <strong>logical table</strong> presentation offers the advantage of decoupling and hiding the physical characteristics of data from applications and users, allowing for the evolution, optimisation, and modification of physical implementation details without impacting users.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wwmz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wwmz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 424w, https://substackcdn.com/image/fetch/$s_!wwmz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 848w, https://substackcdn.com/image/fetch/$s_!wwmz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 1272w, https://substackcdn.com/image/fetch/$s_!wwmz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wwmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png" width="1069" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1069,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wwmz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 424w, https://substackcdn.com/image/fetch/$s_!wwmz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 848w, https://substackcdn.com/image/fetch/$s_!wwmz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 1272w, https://substackcdn.com/image/fetch/$s_!wwmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa95bf8da-0373-44a4-a2b0-e79db84cb644_1069x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So, we've been hearing a ton about open table formats lately, but <strong>what's the big deal?</strong> And what's the difference between open and non-open or closed formats anyway? To figure that out, let's dive into how a general database management system is implemented.</p><h2>Relational Table Format</h2><p>Prior to the Big Data era and emerge of Apache Hadoop in mid 2000s, traditional <strong>Database Management Systems (DBMS)</strong> adhered to a <strong>monolithic</strong> <strong>architectural</strong> design. </p><p>This architecture is comprised of several <em><strong>highly</strong></em><strong> </strong><em><strong>interconnected and tightly coupled layers</strong></em>, each dedicated to specific functionalities essential for the database's operation but all the components are combined to form a single unified system. The storage layer, in particular, managed the physical aspects of data persistence.</p><p>At the core of this structure lays the <strong>Storage Engine</strong>. This component served as the lowest abstraction level, overseeing the physical organisation and management of data on disk. Critical tasks such as transaction management, concurrency control, index management, and recovery were also handled by the storage engine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rzkQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rzkQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 424w, https://substackcdn.com/image/fetch/$s_!rzkQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 848w, https://substackcdn.com/image/fetch/$s_!rzkQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 1272w, https://substackcdn.com/image/fetch/$s_!rzkQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rzkQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png" width="452" height="377.67600487210717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:821,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:12558,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rzkQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 424w, https://substackcdn.com/image/fetch/$s_!rzkQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 848w, https://substackcdn.com/image/fetch/$s_!rzkQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 1272w, https://substackcdn.com/image/fetch/$s_!rzkQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e013b55-49bb-46a5-87a4-4ea821c65dfe_821x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">High level DBMS architecture</figcaption></figure></div><p>Crucially, the query and presentation layers operated in isolation from the intricacies of the storage layer. Data traversed these layers during read and write operations, with each layer imposing its own abstraction. This encapsulation meant that the physical layout and storage format of data remained concealed from external systems.</p><p></p><h4><strong>What&#8217;s the implication?</strong></h4><p>What that means is that, it was not possible to directly access or manipulate database's physical data files using other systems or programming languages like Python as we do now a days. </p><p>Moreover, it lacked <strong>interoperability</strong> as it one cannot just copy database files to another system or simply pointing a generic query engine at a the database's files on OS and interact with the data.</p><p>Given these constraints, the concept of an <strong>Open Table Format (OTF)</strong> as we understand it today was non-existent. Traditional databases employed proprietary storage formats tightly integrated with their specific implementations.</p><p></p><h4><strong>Are we saying there is a fundamental flaw in traditional DBMS architecture!?</strong></h4><p>Many would argue that this is actually not a bad design at all. After all, normal users shouldn't care, know or even change how the underlying physical layer is implemented because they can do catastrophic things! additionally, considerable technical expertise is applied to integrate interconnected components into a cohesive complex system.</p><p>Considering software design best practices it&#8217;s a valid argument as the design fully encapsulates the complexities of managing it, however the tradeoff is that this closed and tightly coupled design hinders <strong>interoperability</strong>, <strong>portability</strong> and <strong>open collaboration</strong> to build scalable and innovative systems based on open standards. </p><p>But is it possible to have the best of both worlds? That is, having highly interoperable, portable, open and scalable data systems and still be able to encapsulate the complexities of managing the low-level tasks such as managing data files on disk? We will find out later in the article.</p><p></p><h2>Hadoop and Big Data Revolution</h2><p>Let's fast forward from the 1970s to 2006, when the <strong>BIG DATA Revolution</strong> took place and the data landscape underwent a seismic shift when Apache Hadoop project was born out of Yahoo leading to <a href="https://materializedview.io/p/databases-are-falling-apart">disassembly of database systems</a>.</p><p>I will not discuss the internals of Apache Hadoop and its architecture as there are lots of material available if you are unfamiliar with it. But one major architectural breakthrough was the <strong>decoupling of storage and compute</strong>. </p><p>This fundamental architectural change allowed for the storage of vast amounts of data in common semi-structured text-based formats such as CSV and JSON, or binary formats like <strong>Avro</strong>, <strong>Parquet</strong>, and <strong>ORC</strong>, on <strong>HDFS </strong>distributed files system deployed on affordable commodity hardware. </p><p>Data could be stored much like files on a local file system and processed using distributed processing frameworks of choice like <strong>MapReduce</strong>, <strong>Pig</strong>, <strong>Hive</strong>, <strong>Impala</strong> and <strong>Presto</strong>.</p><p>For the first time, businesses could store vast amounts of data in their preferred open formats, and leverage different compute engines for various workloads, enabling large-scale analytics. This was a game-changer for those accustomed to inflexible, expensive, monolithic storage systems and proprietary data warehouses.</p><p>But the real breakthrough, as <a href="https://medium.com/s-c-a-l-e/database-guru-on-why-nosql-matters-and-sql-still-matters-c64239fe84fd">stated by AMPLab co-director Michael Franklin</a> was achieving <em><strong>data independence</strong></em> as result of the new decoupled architecture: </p><div class="pullquote"><p><em><strong>The real breakthrough was the separation of the logical view you have of the data and how you want to work with it, from the physical reality of how the data is actually stored.</strong></em></p></div><p>That is why <em><strong>Big Data</strong></em> was such a <em><strong>Big Hype</strong></em> at the time generating such excitement with enterprises rushing to bring the elephant in the room &#8211; a similar level of <strong><a href="https://gradientflow.substack.com/p/learning-from-the-past-comparing">hype surrounds Generative AI</a></strong> today, creating a sense of d&#233;j&#224; vu for some. Nevertheless, the big data was a true revolution, breaking free from the confines of traditional systems and providing the foundation of many innovations that followed next in the <strong><a href="https://practicaldataengineering.substack.com/p/open-source-data-engineering-landscape">open data ecosystem</a></strong>.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h1>1st Generation OTF - The Birth of Open Table Format</h1><p>The initial release of Apache Hadoop presented significant challenges for data engineers. </p><p>Expressing data analysis and processing workloads in <strong>MapReduce</strong> logic using Java was both complex and time-consuming. Moreover, Hadoop lacked a mechanism for storing and managing schemas for datasets on its file system. </p><p>While engineers appreciated Hadoop's flexibility, they yearned for the familiarity of SQL and the two-dimensional table format inherent to relational databases.</p><p>To bridge this gap, Facebook (now Meta), an early and influential Hadoop adopter, initiated the <strong>Hive</strong> project. The goal was to introduce SQL and tabular structures, familiar from traditional relational databases, into the Hadoop and HDFS ecosystem. </p><p>However a key distinction was its new architectural approach: </p><blockquote><p><strong>Being built on top of the decoupled physical layer, leveraging open data formats stored on HDFS distributed file system.</strong> </p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!abLm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!abLm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 424w, https://substackcdn.com/image/fetch/$s_!abLm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 848w, https://substackcdn.com/image/fetch/$s_!abLm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 1272w, https://substackcdn.com/image/fetch/$s_!abLm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!abLm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png" width="1063" height="868" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1063,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82221,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!abLm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 424w, https://substackcdn.com/image/fetch/$s_!abLm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 848w, https://substackcdn.com/image/fetch/$s_!abLm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 1272w, https://substackcdn.com/image/fetch/$s_!abLm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf2695d2-19e2-4e21-b55d-7ebb95259f2e_1063x868.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Open Table Architecture on Hadoop</figcaption></figure></div><p></p><h3>Impact of Apache Hive</h3><p>Facebook open-sourced Hive in 2008, making it available to the broader community. A few years later, <strong>Cloudera</strong>, a prominent Hadoop vendor, developed <strong>Apache Impala</strong>. </p><p>Similar to Hive, Impala offered table management on HDFS, incorporating schema management and features like automatic file format conversion and compaction.</p><p>The introduction of Apache Hive and Impala into the Hadoop stack, the concept of open table formats built upon open file formats was born. Managed and external tables, along with directory-based partitioning, became the primary abstractions for data ingestion, data modeling, and management within the Hadoop ecosystem. </p><p>This new data architecture enabled data integration and processing pipelines to operate independently, loading data files into HDFS in the appropriate format without requiring knowledge on how and by which query engine the data would be consumed.</p><p></p><h2>Evolution of Columnar Binary File Formats</h2><p>Another pivotal advancement was the development of efficient columnar open file formats. This began with <strong>RCFiles</strong>, a first-generation <strong>columnar binary serialisation framework</strong> from Apache Hive project. </p><p>Subsequent innovations included <strong>Apache ORC</strong> as an improved version of RCFile, released in 2013, and <strong>Apache Parquet</strong>, a joint effort between Twitter and Cloudera, also released in 2013. </p><p>These new open file formats dramatically enhanced the performance of OLAP-based analytical workloads on Hadoop, laying the groundwork for building OLAP storage engines directly on data lakes.</p><p>Since then, ORC and Parquet have become the de facto standard open file format for managing data at rest on data lakes, with Parquet being more popular and enjoying wider adoption and support in the ecosystem.</p><p>Next we will dive deeper into how Hive table format is structured, but before that let's generalise the physical design the engines such as Hive and Impala use which heavily relies on the file system directory hierarchy. Lets call it <em><strong>directory-oriented table formats.</strong></em></p><p></p><h2>Directory-oriented Table Formats</h2><p>The most fundamental approach to treating data as a table in a distributed file system such as HDFS (i.e., a data lake) involves projecting a table onto a directory containing immutable data files and potentially sub-directories for partitioning. </p><div class="pullquote"><p>The core principle is to organise data files in a <strong>directory tree</strong>. In essence, a table is just a <em>collection of files</em> tracked at the directory level, accessible by various tools and compute engines.</p></div><p>The important factor to note is that this architecture is inherently <strong>tied to the physical file system layout</strong>, relying on file and directory operations for data management. This has been the standard practice for storing data in data lakes since the inception of Hadoop. </p><p><strong>Directory-based partitioning</strong> allows for organising files based on attributes like event or process date. Schema information can be embedded within data files or managed externally by a schema registry. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lsBs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lsBs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 424w, https://substackcdn.com/image/fetch/$s_!lsBs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 848w, https://substackcdn.com/image/fetch/$s_!lsBs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 1272w, https://substackcdn.com/image/fetch/$s_!lsBs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lsBs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png" width="614" height="627.2398921832884" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:742,&quot;resizeWidth&quot;:614,&quot;bytes&quot;:10012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lsBs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 424w, https://substackcdn.com/image/fetch/$s_!lsBs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 848w, https://substackcdn.com/image/fetch/$s_!lsBs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 1272w, https://substackcdn.com/image/fetch/$s_!lsBs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50bc666-72bc-47f1-ad4f-34dcedf1aa45_742x758.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Directory-oriented table format</figcaption></figure></div><p></p><p>Since table partitions are represented as sub-directories, it becomes the responsibility of the query engines to parse and scan each partition represented as a sub-directory in order to identify the relevant data files during query planning phase. </p><p>This implies that the physical partitioning is tightly coupled with the logical partitioning on the table level with its own constraints which will be discussed later.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1hwE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1hwE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 424w, https://substackcdn.com/image/fetch/$s_!1hwE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 848w, https://substackcdn.com/image/fetch/$s_!1hwE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 1272w, https://substackcdn.com/image/fetch/$s_!1hwE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1hwE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png" width="1140" height="700" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a814187-46e8-477b-a046-883c98381f37_1140x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:700,&quot;width&quot;:1140,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91145,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1hwE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 424w, https://substackcdn.com/image/fetch/$s_!1hwE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 848w, https://substackcdn.com/image/fetch/$s_!1hwE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 1272w, https://substackcdn.com/image/fetch/$s_!1hwE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a814187-46e8-477b-a046-883c98381f37_1140x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now that we have covered what a directory-based table looks like, lets look at the Hive table format.</p><p></p><h3>Hive Table Format</h3><p>With the presented storage model, it&#8217;s fair to say that Apache Hive is a directory-oriented table format, relying on the underlying file system's API for mapping files to tables and partitions. Consequently, Hive is heavily influenced by the physical layout of data within the distributed file system.</p><p>Hive employs its own partitioning scheme, using field names and values to create partition directories. It manages schema, partition, and other metadata in a relational database known as the <strong>Metastore</strong>. </p><p>The Significant shift so far with <strong>Hive + Hadoop</strong> is:</p><blockquote><p><strong>Unlike traditional monolithic databases, Hadoop and Hive's decoupled approach allows other query and processing engines to process the same data on HDFS using Hive engine&#8217;s metadata.</strong></p></blockquote><p></p><p>Following example shows a typical Hive temporal partitioning based on year, month and day.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QG2m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QG2m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 424w, https://substackcdn.com/image/fetch/$s_!QG2m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 848w, https://substackcdn.com/image/fetch/$s_!QG2m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!QG2m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QG2m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg" width="384" height="401.8523245214221" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1148,&quot;width&quot;:1097,&quot;resizeWidth&quot;:384,&quot;bytes&quot;:68245,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QG2m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 424w, https://substackcdn.com/image/fetch/$s_!QG2m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 848w, https://substackcdn.com/image/fetch/$s_!QG2m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!QG2m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d595899-cf06-412e-931c-558f3935846e_1097x1148.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hive-style partitioning scheme</figcaption></figure></div><p></p><p>This leads to another major difference between the new data architecture and traditional DBMS systems: While traditional systems tightly bind data and metadata like table definitions, the new paradigm separates these components. </p><p><em><strong>This decoupling offers great flexibility</strong></em>. Data can be ingested into data lakes without accompanying metadata, and multiple processing systems can independently assign their own metadata or table definitions to the same data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lSfM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lSfM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 424w, https://substackcdn.com/image/fetch/$s_!lSfM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 848w, https://substackcdn.com/image/fetch/$s_!lSfM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 1272w, https://substackcdn.com/image/fetch/$s_!lSfM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lSfM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png" width="558" height="439.19505494505495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1146,&quot;width&quot;:1456,&quot;resizeWidth&quot;:558,&quot;bytes&quot;:31160,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lSfM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 424w, https://substackcdn.com/image/fetch/$s_!lSfM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 848w, https://substackcdn.com/image/fetch/$s_!lSfM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 1272w, https://substackcdn.com/image/fetch/$s_!lSfM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97014e7-e187-49f7-b296-11b183a8f671_1783x1403.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hive Metastore</figcaption></figure></div><p></p><p>Moreover, a <strong>centralised schema registry</strong> (such as the Hive Metastore, which has become the de facto standard) allows any processing engine to interact with data in a structured tabular format, using familiar SQL or python languages using other computation frameworks such as Spark, Presto and Trino. </p><p>By accessing table metadata within the registry, query engines can determine file locations on the underlying storage layer, understand partitioning schemes, and execute their own read and write operations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UWK1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UWK1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 424w, https://substackcdn.com/image/fetch/$s_!UWK1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 848w, https://substackcdn.com/image/fetch/$s_!UWK1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 1272w, https://substackcdn.com/image/fetch/$s_!UWK1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UWK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png" width="441" height="495.02756508422664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:733,&quot;width&quot;:653,&quot;resizeWidth&quot;:441,&quot;bytes&quot;:70242,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UWK1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 424w, https://substackcdn.com/image/fetch/$s_!UWK1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 848w, https://substackcdn.com/image/fetch/$s_!UWK1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 1272w, https://substackcdn.com/image/fetch/$s_!UWK1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c875e2-5562-4f32-8310-bcd0048349eb_653x733.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>I hope you now understand why we refer to this design as <em><strong>open</strong></em> and perhaps begin to appreciate its flexibility and open architecture compared to previous generations of database systems.</p><p></p><h3>Drawbacks and Limitations and the Directory-oriented and Hive table Format</h3><p>For nearly a decade, from 2006 to 2016, Hive reigned supreme as the most popular table format on Hadoop platforms. Tech giants like Uber, Facebook, and Netflix heavily relied on Hive to manage their data. </p><p>However, as these companies scaled their data platforms, they encountered significant scalability and data management challenges that Hive couldn't adequately address. </p><p>Let's delve into the shortcomings of the directory-oriented table formats and Hive-style tables that prompted the engineers at this tech companies to seek alternatives. </p><p>First lets look at challenges and drawbacks of directory-oriented table format, the foundation upon which Hive has been developed:</p><ul><li><p><strong>High Dependency on Underlying File System</strong> - This architecture heavily relies on the underlying storage system to provide essential guarantees like atomicity, concurrency control, and conflict resolution. File systems lacking these properties, such as Amazon S3's absence of atomic rename, necessitate custom workarounds.</p></li><li><p><strong>File Listing Performance</strong> - Directory and file listing operations can become performance bottlenecks, particularly when executing large-scale queries. Cloud object stores like S3 impose significant limitations on directory-style listing operations. Each LIST request returns a maximum of 1000 objects, necessitating multiple sequential requests, which can be slow due to latency and rate limiting. This significantly impacts performance when dealing with large datasets.</p></li><li><p><strong>Query Planning Overhead</strong> - On distributed file systems like HDFS, query planning can be time-consuming due to the need for exhaustive file and partition listing. This is especially pronounced when dealing with a large number of files and partitions.</p></li></ul><p></p><h4>Drawbacks and challenges of using Hive-style partitioning:</h4><ul><li><p> <strong>Over Partitioning</strong> - Tightly coupling physical and logical partitioning can lead to over-partitioning, especially with high-cardinality partition columns like <code>year/month/day</code>. This results in excessive small files, increased metadata overhead, and slower query planning due to the need to scan numerous partitions. Over-partitioning is particularly detrimental to MPP engines like Hive, Spark, and Presto, as they struggle with query planning and scanning a large number of small partitions.</p></li><li><p><strong>Cloud Effect</strong> - Cloud data lakes exacerbate over-partitioning issues due to API call limitations. Jobs scanning many partitions and files often encounter throttling, leading to severe performance degradation.</p></li><li><p><strong>Too Many Small Files</strong> - Incorrect partitioning schemes can create numerous small files, having <a href="https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/">negative impact on different layers</a>, slowing down queries and job planning and re-partitioning requires rewriting the entire dataset, a costly and time-consuming process.</p></li><li><p><strong>Poor Performance</strong> - Queries on Hive-style directory-based partitions can be slow without specifying the partition key for data skipping, especially with deep partition hierarchies. Accidental full table scans become common, leading to inefficient and lengthy query execution.</p></li><li><p><strong>Accidental Costly Queries</strong> - Accidental full table scans can result in launching large queries and jobs. During my years of managing a Hadoop platform, I had to explain many times to end-users why their simple Hive query was taking a long time to run due to scanning large number of partitions during query planning phase.</p></li></ul><p>Imagine a Hive table being partitioned by 20 provinces, followed by <code>year=/month=/day=/hour=</code> partitions. Such a table would accumulate over 1 million partitions in 6 years.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yoEc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yoEc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 424w, https://substackcdn.com/image/fetch/$s_!yoEc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 848w, https://substackcdn.com/image/fetch/$s_!yoEc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 1272w, https://substackcdn.com/image/fetch/$s_!yoEc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yoEc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png" width="1456" height="498" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84473,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yoEc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 424w, https://substackcdn.com/image/fetch/$s_!yoEc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 848w, https://substackcdn.com/image/fetch/$s_!yoEc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 1272w, https://substackcdn.com/image/fetch/$s_!yoEc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0317a170-93d1-40e7-bf74-7bf96a3ed427_2304x788.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Over-partitioning issue on Hive</figcaption></figure></div><p></p><h4>Drawbacks of using External Metastore</h4><p>In addition to the above drawbacks, the Hive-style table using an external Metastore add more challenges into mix:</p><ul><li><p><strong>Performance Bottleneck</strong> - Both Hive and Impala rely on an external metadata store (typically a relational database like MySQL or PostgreSQL), which can become a performance bottleneck due to frequent communication for table operations.</p></li><li><p><strong>Metadata Performance Scalability</strong> - As data volumes and partition counts grow, the Metastore becomes increasingly burdened, leading to slow query planning, increased load, and potential out-of-memory errors. These issues have been extensively documented and addressed by the community. Many companies such as <a href="https://medium.com/airbnb-engineering/upgrading-data-warehouse-infrastructure-at-airbnb-a4e18f09b6d5">Airbnb have experience</a> Metastore performance challenges before upgrading their platform.</p></li><li><p><strong>Single Point of Failure</strong> - The Metastore represents a single point of failure. Crashes or unavailability can cause widespread query failures. Implementing high availability is crucial to mitigate downtime.</p></li><li><p><strong>Network latency</strong> - Network latency between the query engine and the external Metastore, as well as the underlying relational database, can impact overall performance.</p></li><li><p><strong>Inefficient Statistics Management</strong> - Hive's reliance on partition-level column statistics, stored in the Metastore, can hinder performance over time. Wide tables with numerous columns and partitions accumulate vast amounts of statistical data, slowing down query planning and impacting DDL commands like table renaming.</p></li></ul><p></p><h3>A First-hand Experience</h3><p>I have personally faced many of the above challenges working with Hive in production for many years. In a recent project our development team had to rename some large and wide managed Hive tables with about 10k partitions and the rename would just hang and not complete even after many hours. </p><p>After investigation I found that for each table there are about 300k statistical records stored which Hive is trying to gather details and update these records. Even after rebuilding the index on the stats table in PostgreSQL database, the issue didn't fully get resolved.</p><div><hr></div><p>I believe I&#8217;ve made a pretty strong case against Hive table format and its underlying directory-oriented architecture. Apache Hive has served the big data community well for nearly a decade, but its time to improve and develop something more efficient and scalable.</p><p></p><h3>Transactional Guarantees on Data Lakes</h3><p>Before presenting the next evolution of table formats, let's also examine some common challenges associated with implementing database management systems on a data lake backed by distributed file systems such as HDFS or object stores like S3. </p><p>These challenges are not specific to Hive or any other data management tool but are generally related to the ACID and transactional properties of traditional DBMS systems.</p><ul><li><p><strong>Lack of Atomicity</strong> - Writing multiple objects simultaneously within a transaction is not natively supported, hindering data integrity.</p></li><li><p><strong>Concurrency Control Challenges</strong> - Concurrent modifications to files within the same directory or partition can lead to data loss or corruption due to the absence of transaction coordination.</p></li><li><p><strong>Absence of Transactional Features</strong> - Data lakes build on HDFS or object stores lack built-in transaction isolation and concurrency control, requiring organisations to relax consistency requirements or implement custom solutions. Without transaction isolation, readers can encounter incomplete or corrupt data due to concurrent writes. </p></li><li><p>For read and write isolation, downstream consumers would have to implement custom mechanisms to ensure data consistency by waiting for upstream batch data processing workload to complete before initiating their jobs.</p></li><li><p><strong>Support for Record-Level Mutations</strong> - The immutable nature of underlying storage systems prevents direct updates or deletes at the record level in data files.</p></li><li><p><strong>Object Store Challenges</strong> - Object stores like S3 historically lacked strong read-after-write consistency, prompting some organisations to use staging clusters (e.g., HDFS) as an intermediate step before final data placement. Additionally, the absence of atomic rename operations has posed challenges for distributed processing engines like Spark and Hive, which rely on temporary directories for data staging before finalising output.</p></li></ul><p></p><h3>Hive Transactional Tables</h3><p>Hive ACID feature was the first attempt to introduce structured storage guarantees, particularly ACID transactions (Atomic, Consistent, Isolated, Durable), to the realm of immutable data lakes. </p><p>Released in Hive version 3 (2016), this feature marked a significant leap forward by providing stronger consistency guarantees like cross-partition atomicity and isolation. Additionally, it offered improved management of mutable data on data lakes through upsert functionality.</p><p>But addition of ACID to Hive didn&#8217;t solve the fundamental issues because:</p><blockquote><p><strong>Hive ACID tables remained rooted in the directory-oriented approach, relying on a separate metadata store for managing table-level information within the underlying data lake storage layer. </strong></p></blockquote><p>Several attempts were made to integrate Hive ACID into the broader data ecosystem. <strong>Hortonworks</strong> developed the <strong>Hive Warehouse Connector</strong> to enable Spark to read Hive transactional tables, initially relying on <strong>Hive LLAP </strong>component. </p><p><strong>Cloudera</strong> later introduced <strong>Spark Direct Reader</strong> mode in 2020, allowing direct file system access without Hive LLAP dependency.</p><p>Despite these efforts, I would say Hive ACID didn't catch the imagination of the community as it failed to gain widespread adoption due to its underlying design limitations.  Support for reading and writing Hive ACID tables remained inconsistent across the ecosystem, with many prominent tools like Presto offering limited or no support.</p><p></p><p>That&#8217;s the end of Part I. In <strong><a href="https://open.substack.com/pub/practicaldataengineering/p/the-history-and-evolution-of-open-14d?r=23jwn&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part II</a></strong> next generation open table formats will be discussed.</p><p></p>]]></content:encoded></item><item><title><![CDATA[Open Source Data Engineering Landscape 2024]]></title><description><![CDATA[Exploration of the open source software in data engineering ecosystem]]></description><link>https://www.pracdata.io/p/open-source-data-engineering-landscape</link><guid isPermaLink="false">https://www.pracdata.io/p/open-source-data-engineering-landscape</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 28 Jan 2024 08:28:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!N5ze!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1><p>While the widespread hype surrounding Generative AI and ChatGPT took the tech world by storm, 2023 witnessed yet another exciting and vibrant year in the data engineering landscape, steadily grown more diverse and sophisticated, with continuous innovation and evolution across all tiers of the analytical hierarchy.</p><p>With the continued proliferation of open source tools, frameworks, and solutions, the options available to data engineers have multiplied! In such rapidly changing landscape, the importance of staying abreast of the latest technologies and trends cannot be overstated. The ability to choose the right tool for the right job is a crucial skill, ensuring efficiency and relevance in the face of evolving data engineering challenges.</p><p>Having closely followed data engineering trends in my role as a senior data engineer and consultant, I'd like to present the open source data engineering landscape at the beginning of 2024. This includes identifying key active projects and prominent tools, empowering readers to make informed decisions when navigating this dynamic technological landscape.</p><p></p><h1>Why Present Another Landscape?</h1><p>Why make the effort to present yet another data landscape!? There are similar periodic reports such as the famous <strong><a href="https://mattturck.com/mad2023/), and other reports such as  [Data50](https://future.com/data50/)">MAD Landscape</a></strong> , <strong><a href="https://lakefs.io/blog/the-state-of-data-engineering-2023/">State of Data Engineering</a></strong> and <strong><a href="https://cloudinfrastructure.substack.com/p/introducing-the-redpoint-open-source">Reppoint Open Source Top 25</a></strong>, however the landscape I'm presenting is focused solely on open source tools mainly applicable to data platforms and data engineering lifecycle. The MAD Landscape provides a very comprehensive view of all tools and services for Machine Learning, AI and Data, including both commercial and open source, while the landscape presented here provides a more comprehensive view of active open source projects in the <em><strong>Data </strong></em>part of MAD. Other reports such as Reppoint Open Source Top 25 and Data50 focus more on the SaaS providers and startups, whereas this report focuses on the open source projects themselves, rather than the SaaS services.</p><p>Annual reports and surveys such as <strong><a href="https://octoverse.github.com/">Github's state of open source</a></strong> , <strong><a href="https://survey.stackoverflow.co/2023/">Stackoverflow annual survey</a></strong> and <strong><a href="https://ossinsight.io/">OSS Insight reports</a></strong> are also great sources for gaining insight into what's being used or trending in the community, but they only cover limited sections (such as databases and languages) of the overall data landscape.</p><p>Therefore due to my interest in open source data stacks, I've compiled the open source tools and services in data engineering ecosystem.</p><p>So without further due, here is the 2024 Open Source Data Engineering Ecosystem:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N5ze!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N5ze!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 424w, https://substackcdn.com/image/fetch/$s_!N5ze!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 848w, https://substackcdn.com/image/fetch/$s_!N5ze!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 1272w, https://substackcdn.com/image/fetch/$s_!N5ze!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N5ze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png" width="1456" height="1214" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1214,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3252485,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N5ze!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 424w, https://substackcdn.com/image/fetch/$s_!N5ze!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 848w, https://substackcdn.com/image/fetch/$s_!N5ze!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 1272w, https://substackcdn.com/image/fetch/$s_!N5ze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb0a078-3792-4dbc-8547-86f711e27070_5022x4187.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">OSS Data Engineering Landscape 2024</figcaption></figure></div><p></p><h2>Tool Selection Criteria</h2><p>The available open source projects for each category are obviously vast, making it impractical to include every tool and service in the picture. Therefore I have followed the following criteria for selecting the tools for each category:</p><ul><li><p>Any retired, archived and abandoned projects are excluded. Some notable retired projects are <strong>Apache Sqoop</strong>, <strong>Scribe</strong> and <strong>Apache Apex</strong> which might still be used in some production environments.</p></li><li><p>Projects which have been completely inactive on Github over the past year, and are hardly mentioned in the community are excluded. Notable examples are <strong>Apache Pig</strong> and <strong>Apache Oozie</strong> projects.</p></li><li><p>Projects which are still quite new and have not gained much traction in terms of Github stars, forks, as well as blog posts, show cases and mentions in the online communities, are excluded. However some promising projects such as <strong>OneTable</strong> which has made some notable traction and are implemented on the foundation of existing technologies are mentioned.</p></li><li><p>Data Science, ML and AI tools are excluded, except for ML platform and infrastructure tools, as I'm only focusing on what's related to data engineering discipline. </p></li><li><p>Different types of storage systems such as relational OLTP and embedded database systems are listed. This is because data engineering discipline involves dealing with many different internal and external storage systems used in applications and operational systems (BSS), even if they are not part of the analytics stack. </p></li><li><p>The category names are chosen as generic as possible based on where the tool fits in the data stack. For storage systems, main database model and database workload (OLTP, OLAP) are used for grouping and labeling the systems, but for instance "<em>Distributed SQL DBMS</em>" are also referred to as <em><strong>HTAP</strong></em> or <em><strong>scalable SQL databases</strong></em> in the market.</p></li><li><p>Some tools could belong to more than one category. <strong>VoltDB</strong> is both an in-memory database and distributed SQL DBMS. But I have tried to place them in the category by which they are mostly recognised in the market.</p></li><li><p>For certain database systems, there may be a blurry line regarding the category they actually belong to. For example <strong>ByConity</strong> claims to be a data warehousing solution, but is built on top of <strong>ClickHouse</strong> which is recognised as a Real-time OLAP engine. Therefore it is still unclear whether it is real-time (ability to support sub-second queries) OLAP system or not.</p></li><li><p>Not all the listed projects are fully <strong><a href="https://medium.com/@mbhide/open-vs-portable-c1beda131ab6">Portable</a></strong> open source tools. Some of the projects are rather <strong><a href="https://opensource.com/article/21/11/open-core-vs-open-source">Open Core</a></strong> than open source. In <em>open core</em> models, not all components of the full system, as offered by the main SaaS provider, are made open source. Therefore, when deciding to adopt an open-source tool, it is important to consider how portable and truly open source the project is.</p></li></ul><p></p><h2>Overview of Tool Categories</h2><p>In the following section each category is briefly discussed.</p><p></p><h3>1. Storage Systems</h3><p>Storage systems are the largest category in the presented landscape, primarily due to the recent surge of specialized database systems. Two latest trending categories are <strong>vector</strong> and <strong>streaming</strong> databases. <strong>Materialize</strong> and <strong>RaisingWave</strong> are examples of open-source streaming database systems. Vector databases are also experiencing rapid growth in the storage systems field. I have placed vector storage systems in the ML Platform section since they are primarily used in ML and AI stacks. Distributed file systems and object stores are also placed their own related category, that is Data Lake Platform.</p><p>As mentioned in the selection criteria section, storage systems are grouped and labeled based on the main database model and workload. At the highest level, storage systems can be classified into three main classes: <strong>OLTP</strong>, <strong>OLAP</strong>, and <strong>HTAP</strong>. They can be further categorized based on SQL vs NoSQL for OLTP engines, and Offline (non-real-time) vs Real-time (sub-seconds result) for OLAP engines, as shown in the following figure.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v5zc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v5zc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 424w, https://substackcdn.com/image/fetch/$s_!v5zc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 848w, https://substackcdn.com/image/fetch/$s_!v5zc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 1272w, https://substackcdn.com/image/fetch/$s_!v5zc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v5zc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png" width="1456" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94177,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v5zc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 424w, https://substackcdn.com/image/fetch/$s_!v5zc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 848w, https://substackcdn.com/image/fetch/$s_!v5zc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 1272w, https://substackcdn.com/image/fetch/$s_!v5zc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ecd1ba-bd0b-4724-b9e0-f0e4eae11ea8_3840x2147.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>2. Data Lake Platform</h3><p>Data Lake platform has continued to mature in the past year, and Gartner has placed Data Lake in the slope of enlightenment in its 2023 edition of <a href="https://www.gartner.com/en/documents/4573399">Hype Cycle for Data Management</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W2Pw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W2Pw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 424w, https://substackcdn.com/image/fetch/$s_!W2Pw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 848w, https://substackcdn.com/image/fetch/$s_!W2Pw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 1272w, https://substackcdn.com/image/fetch/$s_!W2Pw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W2Pw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png" width="1412" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:1412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:285246,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W2Pw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 424w, https://substackcdn.com/image/fetch/$s_!W2Pw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 848w, https://substackcdn.com/image/fetch/$s_!W2Pw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 1272w, https://substackcdn.com/image/fetch/$s_!W2Pw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63656212-fc30-4e0c-bcce-9bdd02dd6ee1_1412x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Source: Gartner</em></p><p>For storage layer, <strong>distributed file systems</strong> and <strong>object stores</strong> are still the main technologies serving as the bedrock for both on-premise and cloud-based data lake implementations. While <strong>HDFS</strong> is still the primary technology used for on-premise Hadoop clusters, <strong>Apache Ozone</strong> distributed object store is catching up to provide an alternative on-premise data lake storage technology. <strong>Cloudera</strong>, the main commercial Hadoop provider, is now offering Ozone as part of its CDP Private Cloud offering.</p><p>The choice of data serialization format impacts storage efficiency and processing performance. <strong>Apache</strong> <strong>ORC</strong> remains the preferred choice for columnar storage within Hadoop ecosystems, while <strong>Apache Parquet</strong> has emerged as the de-facto standard for data serialization in modern Data Lakes. Its popularity stems from its compact size, efficient compression, and wide compatibility with various processing engines.</p><p>Another key trend in 2023 is the <strong>decoupling of storage and compute layers</strong>. Many storage systems now offer integration with cloud-based object storage solutions like <strong>S3</strong>, leveraging their inherent efficiency and elasticity. This approach allows data processing resources to scale independently from storage, leading to cost savings and enhanced scalability. Cockroachdb supporting S3 as storage backend, and <strong>Confluent</strong>'s offering of long-term <strong>Kafka</strong> topic data retention on <strong>S3</strong> further exemplifies this trend, highlighting the growing use of data lakes as cost-effective, long-term storage solutions.</p><p>One of the hottest developments in 2023 was the rise of <strong>open table formats</strong>. These frameworks essentially act as a table abstraction and virtual data management layer sitting atop your data lake storage and data layer as depicted in the following diagram.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9YvO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9YvO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 424w, https://substackcdn.com/image/fetch/$s_!9YvO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 848w, https://substackcdn.com/image/fetch/$s_!9YvO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 1272w, https://substackcdn.com/image/fetch/$s_!9YvO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9YvO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png" width="1133" height="623" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:623,&quot;width&quot;:1133,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14211,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9YvO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 424w, https://substackcdn.com/image/fetch/$s_!9YvO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 848w, https://substackcdn.com/image/fetch/$s_!9YvO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 1272w, https://substackcdn.com/image/fetch/$s_!9YvO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1a2f2c-c2e3-4383-873f-82dc2527003e_1133x623.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The open table format space is currently dominated by a fierce battle for supremacy between the following three major contenders:</p><ul><li><p><strong>Apache Hudi</strong>: Initially developed and open-sourced by <strong>Uber</strong>, with main design goal for near-real-time data updates and ACID transactions.</p></li><li><p><strong>Apache Iceberg</strong>: Born from <strong>Netflix</strong>'s engineering team.</p></li><li><p><strong>Delta Lake</strong>: Created and open-sourced by <strong>Databricks</strong>, with seamless integration with the Databricks platform.</p></li></ul><p>The funding received by the leading SaaS providers in this space in 2023 &#8211; <strong>Databricks</strong>, <strong>Tabular</strong>, and <strong>OneHouse</strong> &#8211; emphasises market interest and their potential to further advance data management on data lakes.</p><p>Moreover, a new trend is now unfolding with the emergence of <strong>unified data lakehouse layers</strong>. <strong><a href="https://www.onehouse.ai/blog/onetable-is-now-open-source">OneTable</a></strong> (recently open-sourced by OneHouse) and <strong><a href="https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering">UniForm</a></strong> (currently non-open source offering from Databricks) are the first two projects which were announced last year. These tools go beyond individual table formats, offering the ability to work with all three major contenders under a single umbrella. This empowers users to embrace a universal format while exposing data to processing engines in their preferred formats, leading to increased flexibility and agility.</p><p></p><h3>3. Data Integration</h3><p>The data integration landscape in 2023 witnessed not only continued dominance from established players like <strong>Apache Nifi</strong>, <strong>Airbyte</strong>, and <strong>Meltano</strong>, but also the emergence of promising tools like <strong>Apache Inlong</strong> and <strong>Apache SeaTunnel</strong> offering compelling alternatives with their unique strengths.</p><p>Meanwhile, <strong>Streaming CDC</strong> (Change Data Capture) has further matured, fueled by active development in the Kafka ecosystem. <strong>Kafka Connect</strong> and <strong>Debezium</strong> plugins have become go-to choices for near real-time data capture from database systems, while <strong>Flink CDC Connectors</strong> are gaining traction for deployments using Flink as the primary stream processing engine.</p><p>Beyond traditional databases, tools like <strong>CloudQuery</strong> and <strong>Streampipe</strong> are simplifying data integration from APIs, providing convenient solutions for ingesting data from diverse sources. which reflects the growing importance of flexible integration with cloud-based services.</p><p>In the realm of event and messaging middlewares, <strong>Apache Kafka</strong> maintains its strong position, though challengers like <strong>Redpanda</strong> are closing the gap. Redpanda's $100 million Series C funding in 2023 shows the growing interest in alternative message brokers offering low latency and high throughput.</p><p></p><h3>4. Data Processing &amp; Computation</h3><p>The world of Stream processing continued to heat up in 2023! <strong>Apache Spark</strong> and <strong>Apache Flink</strong> remain the reigning champions, however Apache Flink made some serious headlines in 2023. Cloud giants like <strong>AWS</strong> and <strong>Alibaba</strong> jumping on board with <strong>Flink-as-a-service</strong> offerings, and <strong>Confluent</strong>'s acquisition of Immerok for its own fully managed Flink as a service offering, show the momentum behind this powerful engine.</p><p>In the Python ecosystem, data processing libraries such as <strong>Vaex</strong>, <strong>Dask</strong>, <strong>polars</strong>, and <strong>Ray</strong> are available for exploiting multi-core processors. These parallel execution libraries further unlock possibilities for analysing massive datasets within the familiar Python environment.</p><p></p><h3>5. Workflow Management &amp; DataOps</h3><p>The workflow orchestration landscape is arguably the most packed category in presented data ecosystem, filled with established heavyweights and exciting newcomers. </p><p>Veteran tools such as <strong>Apache Airflow</strong> and <strong>Dagster</strong> are still going strong and remains a widely used engines amid the recent hot debates in the community on <a href="https://news.ycombinator.com/item?id=30351461">unbundling</a>, <a href="https://dagster.io/blog/rebundling-the-data-platform">rebundling</a> and <a href="https://www.dataengineeringweekly.com/p/bundling-vs-unbundling-the-tale-of">bundling vs unbundling</a> of workflow orchestration engines. On the other hand In the past two years, GitHub has witnessed the rise of several compelling contenders, capturing significant traction. <strong>Kestra</strong>, <strong>Temporal</strong>, <strong>Mage</strong>, and <strong>Windmill</strong> are all worth watching, each offering unique strengths. Whether focusing on serverless orchestration like Temporal, or distributed task execution like Mage, these newcomers can cater to the evolving needs of modern data pipelines.</p><p></p><h3>6. Data Infrastructure &amp; Monitoring</h3><p>The recent <a href="https://grafana.com/observability-survey-2023/">Grafana Labs Survey,</a> confirms <strong>Grafana</strong>, <strong>Prometheus</strong> and <strong>ELK</strong> stack continue to dominate the observability and monitoring landscape. <strong>Grafana Labs</strong> itself has been quite active, introducing new open-source tools like <strong>Loki</strong> (for log aggregation) and <strong>Mimir</strong> (for long-term Prometheus storage) to further strengthen its platform.</p><p>One area where open-source tools seem less prevalent is <strong>cluster management and monitoring</strong>. This likely stems from the cloud migration trend, reducing the need for managing large on-premise data platforms. While the <strong>Apache Ambari</strong> project, once popular for managing Hadoop clusters, was practically abandoned after the <strong>Hortonworks-Cloudera merger</strong> in 2019, a <a href="https://www.openlogic.com/blog/about-apache-ambari">recent revival</a> sparks some hope for its future. However, its long-term fate remains uncertain.</p><p>As for resource scheduling and workload deployment, <strong>Kubernetes</strong> seems to be the preferred resource scheduling specially on <a href="https://www.pepperdata.com/2232023-pepperdata-survey-uncovers-state-kubernetes-2023-cloud-cost-remediation">cloud-based platforms</a>.</p><p></p><h3>7.  ML Platform</h3><p>Machine Learning Platform has been one of the most active categories with unprecedented rise and interest in <strong>Vector databases</strong>, specialised systems optimised for the storage and retrieval of high-dimensional data. As highlighted by <a href="https://www.datanami.com/2023/11/14/retools-state-of-ai-report-highlights-the-rise-of-vector-databases/">DB-Engines' 2023 report</a>, vector databases emerged as the most popular database category in the past year.</p><p><strong>MLOps</strong> tools also play an increasingly vital role in scaling ML projects efficiently, ensuring smooth operations and ML application lifecycle management. As the complexity and scale of ML deployments continue to grow, MLOps tools have become indispensable for streamlining development, deployment, and monitoring of ML models.</p><p></p><h3>8. Metadata Management</h3><p>In recent years, metadata management has taken center stage, propelled by the growing need to govern and improve management and access to data. However, the lack of comprehensive metadata management platforms prompted tech giants like Netflix, Lyft, Airbnb, Twitter, LinkedIn and Paypal to build their own solutions.</p><p>These efforts yielded some remarkable contributions to the open-source community. Tools like <strong>Amundsen</strong> (from Lyft), <strong>DataHub</strong> (from LinkedIn), and <strong>Marquez</strong> (from WeWork) are homegrown solutions, which have been open sourced and are under active development and contribution.</p><p>When it comes to schema management, the landscape remains somewhat stagnant. <strong>Hive Metastore</strong> continues to be the go-to solution for many as there are currently no alternative open source solution to replace it.</p><p></p><h3>9. Analytics &amp; Visualisation</h3><p>In the Business Intelligence (BI) and visualisation domain, <strong>Apache Superset</strong> stands out as the most active and popular open-source alternative to licensed SaaS BI solutions. </p><p>As for distributed and Massive Parallel Processing (MPP) engines, some experts argue that <strong><a href="https://motherduck.com/blog/big-data-is-dead/](https://motherduck.com/blog/big-data-is-dead/">big data is dead</a></strong> and majority companies don't require large-scale distributed processing, opting for single, powerful servers to handle their data volumes. </p><p>Despite this claim, distributed Massively Parallel Processing (MPP) engines like <strong>Apache Hive</strong>, <strong>Impala</strong>, <strong>Presto </strong>and<strong> Trino</strong> remain prevalent within large data platforms, especially for petabyte-scale data.</p><p>Beyond traditional MPP engines, <strong>uniform execution engines </strong>are another trend gaining traction. Engines such as <strong>Apache Linkis</strong>, <strong>Alluxio</strong> and <strong>Cube</strong> provide a query and computation middleware between upper applications and underlying engines.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YNND!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YNND!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 424w, https://substackcdn.com/image/fetch/$s_!YNND!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 848w, https://substackcdn.com/image/fetch/$s_!YNND!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 1272w, https://substackcdn.com/image/fetch/$s_!YNND!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YNND!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png" width="1456" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;linkis-intro-03&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="linkis-intro-03" title="linkis-intro-03" srcset="https://substackcdn.com/image/fetch/$s_!YNND!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 424w, https://substackcdn.com/image/fetch/$s_!YNND!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 848w, https://substackcdn.com/image/fetch/$s_!YNND!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 1272w, https://substackcdn.com/image/fetch/$s_!YNND!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d2c8743-e1e3-47a6-af71-942f5ff730d6_4056x1598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>source: <a href="https://github.com/apache/linkis">Github</a></em></p><h1>Conclusion</h1><p>This exploration of the open-source data engineering landscape is a glimpse into the dynamic and vibrant world of data platforms. While prominent tools and technologies were covered across various categories, the ecosystem continues to evolve rapidly, with new solutions emerging continuously.</p><p>Remember, this is not an exhaustive list, and the "best" tools are ultimately determined by your specific needs and use cases. Feel free to share any notable tools I've missed that you think should&#8217;ve be included. </p><p></p><h3>Update:</h3><p>I&#8217;ve created a live <a href="https://github.com/pracdata/awesome-open-source-data-engineering">repository on Github</a> with the full list and link to all listed projects. Please feel free to track and collaborate.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How to build a dual Incremental + snapshot data ingestion pipeline]]></title><description><![CDATA[A useful batch data ingestion pattern for maximum data correctness and reliability as well as providing low latency access.]]></description><link>https://www.pracdata.io/p/how-to-build-a-dual-incremental-snapshot</link><guid isPermaLink="false">https://www.pracdata.io/p/how-to-build-a-dual-incremental-snapshot</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 01 Oct 2023 13:21:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my <a href="https://open.substack.com/pub/practicaldataengineering/p/common-techniques-for-extracting?r=23jwn&amp;utm_campaign=post&amp;utm_medium=web">previous blog post</a> I covered different relational database data extraction techniques when using batch or micro-batch data collection frameworks.</p><p>Now let&#8217;s discuss a more practical example of developing a useful data ingestion pattern with dual data pipelines, one pipeline for incrementally loading recent data with a lower latency such as NRT or hourly, and another one to fully reload the dataset using the snapshot mechanism, with a higher latency such as nightly.</p><p>As mentioned briefly in the <a href="https://open.substack.com/pub/practicaldataengineering/p/common-techniques-for-extracting?r=23jwn&amp;utm_campaign=post&amp;utm_medium=web">previous blog post</a>, when working in a not fully reliable data environment, there may be situations where you cannot rely on the accuracy of source database metadata columns such as the record updated timestamp, to ensure that no changes in the source data are ever missed. In some cases, records are manually fixed, changed, or updated by system administrators and DBAs in the backend system or database, but these changes may not be reflected in the next data ingestion run if the control or metadata columns are not updated accordingly. Additionally, corrupt or incorrect records may be deleted without leaving any trace. Therefore, when having strict SLAs to always have the ingested data matches the source data 100%, you are left with only a few choices, such as fully reloading the data in the target storage system.</p><p>In order to have the best of the both words that is both have most recent data in a low latency manner and ensure that eventually data is always correct, one technique is to build a dual data pipeline for meeting each objective. This is similar to <a href="https://www.databricks.com/glossary/lambda-architecture">Lambda architecture</a> with two distinct pipelines where the reliable batch pipeline eventually catches up with the real-time streaming data pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kvcF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kvcF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 424w, https://substackcdn.com/image/fetch/$s_!kvcF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 848w, https://substackcdn.com/image/fetch/$s_!kvcF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 1272w, https://substackcdn.com/image/fetch/$s_!kvcF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kvcF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png" width="1456" height="307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:307,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kvcF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 424w, https://substackcdn.com/image/fetch/$s_!kvcF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 848w, https://substackcdn.com/image/fetch/$s_!kvcF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 1272w, https://substackcdn.com/image/fetch/$s_!kvcF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5a62c27-e364-467b-b4c6-8b8ddd48f06e_1553x327.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>In terms of scheduling and latency following has to be decided for each pipeline:</p><ul><li><p>Frequency and latency of incremental pipeline based on business needs. This can range from NRT using streaming CDC based pipeline to hourly or even longer latency.</p></li><li><p>Frequency of snapshot pipeline run depending on how long the business can tolerate data drifts or uncaptured changes by the incremental pipelines.</p></li></ul><p>In the example to follow, an incremental pipelines is developed to extract recent data from a source system (ex CRM) object such as <em>product</em> table, on an hourly basis, and a daily snapshot pipeline is implemented to fully reload the data on the target data platform on nightly basis.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D-bt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D-bt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 424w, https://substackcdn.com/image/fetch/$s_!D-bt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 848w, https://substackcdn.com/image/fetch/$s_!D-bt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!D-bt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D-bt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D-bt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 424w, https://substackcdn.com/image/fetch/$s_!D-bt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 848w, https://substackcdn.com/image/fetch/$s_!D-bt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 1272w, https://substackcdn.com/image/fetch/$s_!D-bt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea324c1-7b60-4e57-8dfc-b2463bde5c8e_2691x1454.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Pipeline Development and Workflow Orchestration</h2><p>For demonstration purpose, popular <strong>Apache Airflow</strong> orchestration engine will be used for implementation and management of the two snapshot and incremental pipelines.</p><p>At the DAG level we would have two ingestion DAGs, one scheduled to run <em>hourly</em> for the incremental pipeline, and another one scheduled to run <em>daily</em> for the snapshot pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wBPg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wBPg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 424w, https://substackcdn.com/image/fetch/$s_!wBPg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 848w, https://substackcdn.com/image/fetch/$s_!wBPg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 1272w, https://substackcdn.com/image/fetch/$s_!wBPg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wBPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png" width="1300" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65174,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wBPg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 424w, https://substackcdn.com/image/fetch/$s_!wBPg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 848w, https://substackcdn.com/image/fetch/$s_!wBPg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 1272w, https://substackcdn.com/image/fetch/$s_!wBPg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc703c2a-50dc-4d7f-bd0a-317722e1431d_1300x356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>Snapshot Pipeline</h3><p>Our daily snapshot workflow for the sample <em>product</em> table would like the following DAG:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a72H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a72H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 424w, https://substackcdn.com/image/fetch/$s_!a72H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 848w, https://substackcdn.com/image/fetch/$s_!a72H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 1272w, https://substackcdn.com/image/fetch/$s_!a72H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a72H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png" width="1456" height="175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:175,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58435,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a72H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 424w, https://substackcdn.com/image/fetch/$s_!a72H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 848w, https://substackcdn.com/image/fetch/$s_!a72H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 1272w, https://substackcdn.com/image/fetch/$s_!a72H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e9d232c-5c28-40a6-9bb4-8f9c2744f31e_1866x224.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><code>start_ingestion</code> and <code>end_ingestion</code> tasks are just dummy operators which are useful when dealing with multiple parallel tasks and performing data operations such as clearing and re-running downstream tasks.</p><p>The <code>import_snapshot_product</code> queries and extracts the table data from source database and ingests into a new snapshot partition on the data lake. Once the data is imported, the previous snapshot partition is dropped.</p><p>Few pipeline development best practices to consider:</p><ul><li><p>To ensure that the pipeline is fully <em><a href="https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a">idempotent and deterministic</a></em>, is better to make the extraction query bounded by max <em>datatime</em> such as midnight timestamp, so that in case the pipeline has to be re-executed it always extracts and loads the same data as the first run attempt.</p></li><li><p>Any existing partition partition that might have been created in the previous failed run has to be checked and dropped before loading.</p></li></ul><p>In SQL terms we would have a Jinja templated SQL query statement similar to following for data extraction using Airflow built-in variables :</p><pre><code><code>SELECT * FROM products WHERE ts &lt; {{ data_interval_end  }} 
</code></code></pre><p>For snapshot partition operation if using an SQL based engine such as Hive or Trino (Presto) to manage the data on the data lake would have a templated SQL statement similar to following:</p><pre><code><code>ALTER TABLE products
DROP IF EXISTS PARTITION (snapshot_date='{{ prev_start_date_success.to_date_string() }}'
</code></code></pre><p>Depending on the available tools appropriate execution engine can be used for data extraction and import ranging from simple python JDBC connection, to distributed computational tools such as Trino or Sqoop for Hadoop-based platforms, or cloud-based tools such as AWS Glue if running and operating on Cloud.</p><p>The <code>truncate_product_incremental</code> airflow task is simply responsible for truncating the <code>product_incremental</code> dataset.</p><p></p><h3>Incremental Pipeline</h3><p>The incremental ingestion pipeline in our example would be scheduled to run <em>hourly</em> on Airflow and in the most simple form it would consist of the following DAG:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qLQQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qLQQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 424w, https://substackcdn.com/image/fetch/$s_!qLQQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 848w, https://substackcdn.com/image/fetch/$s_!qLQQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 1272w, https://substackcdn.com/image/fetch/$s_!qLQQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qLQQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png" width="1456" height="255" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:255,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42332,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qLQQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 424w, https://substackcdn.com/image/fetch/$s_!qLQQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 848w, https://substackcdn.com/image/fetch/$s_!qLQQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 1272w, https://substackcdn.com/image/fetch/$s_!qLQQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322bce96-cce0-4301-96eb-9e3c0f25edbc_1748x306.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Similar to the snapshot operation, to ensure that the pipeline is fully <em>deterministic</em> and <em>idempotent</em> and be safe to retry, we need to ensure any existing data or partitions created as result of previous DAG runs are dropped when <code>import_incremental_product</code> task is executed.</p><p></p><h2>Dynamic DAG Generation</h2><p>When have many candidate source tables to be imported using this data ingestion pattern, it would be cumbersome and bad software engineering practice to create a separate DAG or tasks for each table with lots of code duplication.</p><p>We can take advantage of dynamic DAG generation in Airflow by having a JSON or YAML configuration file with the list of tables, and dynamically generate the required tasks in our DAG script. This pattern is referred to <strong><a href="https://www.thoughtworks.com/en-in/radar/techniques/declarative-data-pipeline-definition">declarative or config-driven</a></strong> data pipeline development.</p><p>For demonstration purpose lets create the following JSON file having listed four tables to be ingested by our airflow snapshot and incremental workflows.</p><pre><code><code># lib/snapshot_sources.json
[
  {
    "schema": "public",
    "table": "product"
  },
  {
    "schema": "public",
    "table": "category"
  },
  {
    "schema": "public",
    "table": "clients"
  },
  {
    "schema": "public",
    "table": "suppliers"
  }
]
</code></code></pre><p>Using the above config file, In our <code>snapshot_ingestion</code> DAG script, the required tasks for each table can be generated as follows:</p><pre><code><code>import json


items = json.load(open('/var/lib/airflow/dags/lib/snapshot_sources.json'))

for item in items:
  import_snapshot = PythonOperator(
    task_id="import_snapshot_{}".format(item['table']),
    python_callable=_import_snapshot,
    provide_context=True,
    op_kwargs={ 'table': item['table'], 'schema': item['schema'] },
    dag=dag
  )
  truncate_incremental = PythonOperator(
    task_id="truncate_{}_incremental".format(item['table']),
    python_callable=_reload_tables,
    provide_context=True,
    op_kwargs={ 'table': item['table'] },
    dag=dag
  )
  import_snapshot.set_upstream(start_ingestion)
  import_snapshot.set_downstream(truncate_incremental)
  truncate_incremental.set_downstream(end_ingestion)
</code></code></pre><p>Now our new DAG is generated dynamically:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4H6e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4H6e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 424w, https://substackcdn.com/image/fetch/$s_!4H6e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 848w, https://substackcdn.com/image/fetch/$s_!4H6e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 1272w, https://substackcdn.com/image/fetch/$s_!4H6e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4H6e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png" width="1282" height="596" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:596,&quot;width&quot;:1282,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90557,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4H6e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 424w, https://substackcdn.com/image/fetch/$s_!4H6e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 848w, https://substackcdn.com/image/fetch/$s_!4H6e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 1272w, https://substackcdn.com/image/fetch/$s_!4H6e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac9e629-8c10-499d-b374-a68d8255e0a3_1282x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is also where dummy tasks such as <code>start_ingestion</code> come handy in case we need to clear the state of all <code>import_snapshot</code> tasks which one click.</p><p></p><h2>Conclusion</h2><p>In this article, we have discussed a batch data ingestion pattern that combines an incremental pipeline with a snapshot ingestion pipeline. This pattern can be quite useful in certain use-cases where you need 100% data completeness and consistency with the source, such as when dealing with financial data, where the incremental pipeline alone may not provide the required level of consistency and reliability.</p>]]></content:encoded></item><item><title><![CDATA[Common Techniques For Periodically Extracting Data From Relational Databases]]></title><description><![CDATA[Presenting different techniques for extracting data from relational database systems when building ETL pipelines for a central data lake, data warehouse or data lakehouse.]]></description><link>https://www.pracdata.io/p/common-techniques-for-extracting</link><guid isPermaLink="false">https://www.pracdata.io/p/common-techniques-for-extracting</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Mon, 18 Sep 2023 10:34:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>1. Introduction</h2><p>Extracting data from relational databases such as those used in Business Support Systems (BSS) is a common ETL and data extraction pattern in analytical data platforms. It&#8217;s the job of data engineers to choose the best approach to build RDBMS data ingestion pipelines, whether integrating data to a central data lake, data warehouse or data lakehouse.</p><p>There are many factors to consider when deciding how to extract data from the source system in the most efficient and optimal way. I prefer to distinguish how data is <em>extracted</em> from source to how data is <em>loaded</em> in the target platform as I see them as two separate problems to solve. This article covers common data extraction patterns focusing on how the data is <em><strong>extracted</strong></em> from a relational database, usually in batch or micro-batch mode. The logic and patterns of how the data is loaded and reconciled in the target data repository once it is extracted from the source, is not fully addressed in the presented techniques.</p><p>The decision of which extraction pattern to use for each dataset/table is largely determined by the source table profile and its properties.</p><p><strong>Main Criteria</strong></p><ul><li><p>Type of the dataset (fact, dimension, immutable log, etc)</p></li><li><p>Size of the table (volume, deepness and wideness)</p></li><li><p>Availability of metadata or control columns such as incremental unique ID, created timestamp and modified timestamp</p></li><li><p>The rate of insert and update (how fast the data changes)</p><p></p></li></ul><p><strong>Other criteria which should be&nbsp;considered as well:</strong></p><ul><li><p>Data Latency and data freshness requirement of the business</p></li><li><p>Availability and cost of storage and compute resources</p></li><li><p>Source database engine resource availability</p></li></ul><p></p><h2>2. Techniques for detecting changed data</h2><p>When dealing with immutable insert-only data (facts in data warehouse term) such as transactional records which are not modified after insert, the logic for extracting data is simpler as we don&#8217;t have to worry about updates and deletes, and we only need to detect new inserts. However when dealing with mutable data (dimensions in data warehouse term) in relational databases, in addition to detecting new inserts, we also need to detect changed records such as those updated or deleted.</p><p>There are different techniques for detecting changed data in relational databases:</p><p></p><h3>(1) Query-based CDC - Using Audit Columns</h3><div><hr></div><ul><li><p>Identify changed data using SQL queries on the source engine by filtering based on an audit column.</p></li></ul><ul><li><p>These columns are usually auto populated either by the front-end systems or database triggers. Examples are incremental Ids, <em>create_date</em> and <em>modified_timestamp</em> fields. The integrity of these columns is important and must be verified.</p></li><li><p>If the columns are updated by the front-end system and not database triggers, there is a risk of records being manually changed by the back-end developers or DBA's without updating the audit columns.</p></li><li><p>The existence of NULL values can also raise concerns about the integrity of these columns.</p></li><li><p>Once the integrity of the audit column is verified, one of the most common approaches to use the modified column is to extract all records whose modified timestamp are greater than the maximum timestamp of the previous load in incremental fashion which will be discussed in the next section. However we must always approach such audit columns with cautions and never assume they are consistent and reliable until proven.</p></li><li><p>For immutable insert-only tables, the timestamp column or an incremental <em>row id</em> identifier is good enough to be used as the "<em><strong>maximum value column</strong></em>" to extract new records on each incremental load.</p><p></p></li><li><p><strong>Pros</strong></p><ul><li><p>Straightforward and flexible using already familiar SQL semantics</p></li><li><p>No additional configuration on source system or dependencies are needed</p></li></ul></li><li><p><strong>Cons</strong></p><ul><li><p>This this technique doesn't capture physical DELETES. only INSERTS and UPDATES are detected</p></li><li><p>Extra loads on the database engine which can cause performance issues for the front-end systems accessing the engine.</p></li><li><p>It is possible that not all changes are captured (ex record manually updated on the database level without timestamps being updated)</p></li></ul></li></ul><p></p><h3>(2) Trigger-based CDC - Using Change-log Table/Files</h3><div><hr></div><ul><li><p>If the audit columns are not available, it is possible to develop new logic on the database back-end to capture inserts and updates by means of database triggers for a set of required tables, and capture those records in a change-log table or file.</p><p></p></li><li><p><strong>Pros</strong></p><ul><li><p>All changes can be captured (INSERT, DELETE, UPDATE).</p></li></ul></li><li><p><strong>Cons</strong></p><ul><li><p>Triggers have to be defined for all tables required. introduces additional processing and storage cost.</p></li><li><p>The <em>change-log</em> table has to be queried to extract changes which puts load on the source engine.</p></li></ul></li></ul><p></p><h3>(3) Log-based Change-Data-Capture (CDC)</h3><div><hr></div><ul><li><p>Using database redo logs or third-party <em>change data capture</em> libraries and tools it is possible to identify the new and changed records for the tables of interest. This method is usually used in event-driven and streaming workflow patterns.</p><p></p></li><li><p><strong>Pros</strong></p><ul><li><p>Captures all the changes (INSERT, DELETE, UPDATE)</p></li><li><p>No additional processing load on the source database engine</p></li><li><p>Decoupling data ingestion pipelines from the source database systems</p></li></ul></li><li><p><strong>Cons</strong></p><ul><li><p>More complex to implement and reason about</p></li><li><p>Not all systems support log-based CDC (or require additional license)</p></li><li><p>Usually an additional third-party CDC tool (ex Debezium) and <em>Message Broker (ex Kafka)</em> needs to be included in the data framework which means additional complexity, maintenance and monitoring will be added to the platform.</p></li></ul></li></ul><p></p><h2>3. Data Extraction Patterns</h2><p>Once we have decided who to detect changes in the upstream database system we can look at the possible extraction patterns for building the required ETL pipelines.</p><p></p><h3>(1) Periodic Full Reload</h3><div><hr></div><p>(Also referred to as <strong>truncate/reload</strong> in Data warehousing terms)</p><p>This technique is as simple as taking a periodic full snapshot of the source table and replacing the target dataset. This is one of the most simplest techniques for extracting mutable data which is not even considered with how to detect changes in the source table. However it is only suitable under specific circumstances such as dealing with small and slow changing dimensional data which are not updated too frequently.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!akHp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!akHp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 424w, https://substackcdn.com/image/fetch/$s_!akHp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 848w, https://substackcdn.com/image/fetch/$s_!akHp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 1272w, https://substackcdn.com/image/fetch/$s_!akHp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!akHp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png" width="1373" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1373,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!akHp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 424w, https://substackcdn.com/image/fetch/$s_!akHp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 848w, https://substackcdn.com/image/fetch/$s_!akHp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 1272w, https://substackcdn.com/image/fetch/$s_!akHp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7e1a121-d648-4b76-8b06-62f734b8a2db_1373x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Pros</strong>:</p><ul><li><p>Simple and easy to implement</p></li><li><p>Effective for small tables and slow changing dimensions</p></li><li><p>Deletes are captured automatically</p></li><li><p>Schema evolution is automatically supported</p></li></ul></li><li><p><strong>Cons</strong>:</p><ul><li><p>Scalability: Load performance degrades as source table grows larger</p></li><li><p>Process load times and performance of the ETL job for large tables</p></li><li><p>History of data changes is not preserved</p></li></ul><p></p></li></ul><p>It must be noted that this method doesn&#8217;t preserve historical changes of the source table as it reloads the data every time the pipeline is executed.</p><p></p><h3>(2) Periodic Partial Reload</h3><div><hr></div><p>To avoid performing a full truncate-reload by taking full snapshot containing old and historical data on each pipeline execution, we can reduce the volume of the snapshot data to only contain recent data such as last 12 months, last 6 months etc for larger tables. This extraction pattern relies on availability of a timestamp attribute to be used to filter recent records. There will still be a one-time job to extract all the historical data at the beginning and only from there after the partial snapshot extraction will be scheduled.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UlUr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UlUr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 424w, https://substackcdn.com/image/fetch/$s_!UlUr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 848w, https://substackcdn.com/image/fetch/$s_!UlUr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 1272w, https://substackcdn.com/image/fetch/$s_!UlUr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UlUr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png" width="1391" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1391,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UlUr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 424w, https://substackcdn.com/image/fetch/$s_!UlUr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 848w, https://substackcdn.com/image/fetch/$s_!UlUr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 1272w, https://substackcdn.com/image/fetch/$s_!UlUr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f11a0e2-0799-4a4a-a573-4bc6cfdff517_1391x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Pros</strong>:</p><ul><li><p>Avoids repeatedly extracting large unmodified historical data</p></li><li><p>Effective for small to medium tables with large historical records</p></li></ul></li><li><p><strong>Cons</strong>:</p><ul><li><p>Adds additional ingestion complexity for de-duplication and reconciliation compared to the full snapshot method</p></li><li><p>Relies on a "modified timestamp" column to be always correct when a record is added or modified on the source</p></li><li><p>Deletes are only captured for the snapshot period covered</p></li><li><p>Schema validation might be required</p></li></ul></li></ul><p>You might ask why would one choose this approach instead of simply implementing an incremental pipeline only extracting the latest changed data (which is presented as another technique in this article). To answer, there might be scenarios where extracting data using a fixed period might be more suitable and reliable such as when a large table is partitioned only yearly or monthly and using the monthly partition as the predicate could provide a more efficient and faster ETL operation than using a non-partitioned timestamp when reading data from the source engine.</p><p>Additionally, when there is a degree of doubt and unreliability in the source system and the fear that the system admins or DBAs might manually change recent data in the backend when corruption or data quality issues happen, one might rather decide to extract data with a longer period as a protection against such incidents.</p><p></p><h3>(3) Periodic Snapshots</h3><div><hr></div><p>This technique is similar to <strong>Technique (1)</strong> with the difference that instead of truncating and reloading the target dataset, each time the full snapshot is extracted from the source system, it is inserted into a new versioned partition on the target data repository.</p><p>This method is simple and additionally provides the advantage of preserving history compared to "Periodic Full Reload" method. However it comes at a cost of large data duplication and redundancy. Therefore it is not suitable for large and fast changing datasets or dimensions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lYsc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lYsc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 424w, https://substackcdn.com/image/fetch/$s_!lYsc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 848w, https://substackcdn.com/image/fetch/$s_!lYsc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 1272w, https://substackcdn.com/image/fetch/$s_!lYsc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lYsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png" width="1456" height="428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lYsc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 424w, https://substackcdn.com/image/fetch/$s_!lYsc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 848w, https://substackcdn.com/image/fetch/$s_!lYsc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 1272w, https://substackcdn.com/image/fetch/$s_!lYsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c83b19-14a6-4f9f-96f9-4d33e3c132b2_1491x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Pros</strong>:</p><ul><li><p>Simple and easy to implement</p></li><li><p>Effective for small tables and slow changing dimensions</p></li><li><p>Deletes are captured automatically</p></li><li><p>Historical data changes are preserved</p></li><li><p>Good performance on read and write</p></li></ul></li><li><p><strong>Cons</strong>:</p><ul><li><p>Data duplication and storage inefficiency is the main downside</p></li><li><p>Not suitable for fast changing dimensions</p></li><li><p>Dimension needs to be created from <strong>one process only</strong></p></li><li><p>Requirements to refresh dimensions on a more frequent cadence than daily will also amplify duplication</p></li></ul><p></p></li></ul><p>This technique requires computing the snapshot usually at the query time, or rely on a separate pipeline to build the current state in a separate dataset and therefore preserving two datasets, one containing all the history with duplicate data, and one dataset containing only the current state.</p><p>Due to maintaining large amount of duplicate data, usually data retention policies should be in-place to periodically discard older data to reclaim space.</p><p></p><h3>(4) Incremental Inserts</h3><div><hr></div><p>Provided that the source table is immutable, meaning that it is an insert-only table with no updates or modifications, and there is an incremental row id/primary-key or insert timestamp, the new records can be ingested into the central data platform incrementally using the <strong>Maximum Value Column</strong>.</p><p>This is one of the most common approaches for incremental ingestion of transactional data. It&#8217;s straight forward technique and since we don&#8217;t have to worry about updates we only need to keep track of the last extracted record from source.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fdko!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fdko!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 424w, https://substackcdn.com/image/fetch/$s_!Fdko!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 848w, https://substackcdn.com/image/fetch/$s_!Fdko!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 1272w, https://substackcdn.com/image/fetch/$s_!Fdko!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fdko!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png" width="1391" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1391,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fdko!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 424w, https://substackcdn.com/image/fetch/$s_!Fdko!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 848w, https://substackcdn.com/image/fetch/$s_!Fdko!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 1272w, https://substackcdn.com/image/fetch/$s_!Fdko!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62e86624-f940-4ac4-bc43-7d36c0fbd611_1391x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Pros</strong>:</p><ul><li><p>Reduces ingestion workload as only new records are extracted from the source</p></li><li><p>Suitable for large immutable log-based datasets</p></li></ul></li><li><p><strong>Cons</strong>:</p><ul><li><p>It relies on and assumes the source system's Maximum Value Column such as incremental row id /primary-key or create-date timestamp to always be correct and inserted</p></li><li><p>Deletes are not captured if it matters in the target systems and downstream data consumers</p></li></ul></li></ul><p></p><h3>(5) Incremental Upserts</h3><div><hr></div><p>Incremental inserts works when we are dealing with immutable insert-only data such as transactional data. However for mutable data, provided there is a modified timestamp column on the source table, we can still use the same technique as the "<em>audit column</em>" to identify and only ingest the new and changed records on incremental basis into the data lake.</p><p>The only difference is that instead of insert, upsert has to be done on the target dataset to eliminate duplicates. There are different upsert and de-duplication techniques such as using SQL upsert statement which won&#8217;t be covered in this article.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cKkb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cKkb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 424w, https://substackcdn.com/image/fetch/$s_!cKkb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 848w, https://substackcdn.com/image/fetch/$s_!cKkb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 1272w, https://substackcdn.com/image/fetch/$s_!cKkb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cKkb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png" width="1456" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cKkb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 424w, https://substackcdn.com/image/fetch/$s_!cKkb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 848w, https://substackcdn.com/image/fetch/$s_!cKkb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 1272w, https://substackcdn.com/image/fetch/$s_!cKkb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F216f1250-a5ce-49db-8bfe-bd6ade31c1f4_1482x441.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Pros</strong>:</p><ul><li><p>Reduces ingestion workload as only new and changed records are extracted from the source</p></li><li><p>Suitable for large and fast changing tables</p></li></ul></li><li><p><strong>Cons</strong>:</p><ul><li><p>Extra logic has to be added to the pipeline for de-duplication and replacement of the changed records in the target dataset. Modern table formats such as Apache Hudi or Iceberg could be implemented to take care of the logic under the hood.</p></li><li><p>It relies on the assumption that the modified timestamp column is always updated when a record is changed</p></li><li><p>Deletes are not captured automatically if required</p></li></ul></li></ul><p></p><h3>(6) Hybrid - Incremental Insert/Upsert + Periodic Reload</h3><div><hr></div><p>In a scenario when incremental Insert or Upsert pattern is required but the audit or <em>maximum value</em> column cannot be fully trusted to always be updated automatically, a hybrid pattern can be employed where the is a incremental job scheduled for every x hours to ingest new/updated records from the source, and additionally another periodic job with a higher latency such as daily or weekly is scheduled to ingest a full or partial snapshot of the source data to rectify any missing or outdated records ingested by the first incremental job.</p><p>Of course the business has to accept the risk that until the full snapshot of data is ingested at the period it is scheduled for, the incremental data ingested might not be 100% accurate but it can still be a viable option in environments were strict governance over source data is not exercised and manual modification of records is possible, which presents the risk of missing the most recent version of modified data on the target system.</p><p>Following figure shows an example of performing hourly incremental load , while doing a full snapshot with truncating and reloading the target dataset on a daily basis.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tR8t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tR8t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 424w, https://substackcdn.com/image/fetch/$s_!tR8t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 848w, https://substackcdn.com/image/fetch/$s_!tR8t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 1272w, https://substackcdn.com/image/fetch/$s_!tR8t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tR8t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png" width="1456" height="469" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:469,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tR8t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 424w, https://substackcdn.com/image/fetch/$s_!tR8t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 848w, https://substackcdn.com/image/fetch/$s_!tR8t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 1272w, https://substackcdn.com/image/fetch/$s_!tR8t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ef1a45-39a0-4d89-a1a1-121753394ad1_2113x681.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Pros</strong>:</p><ul><li><p>Reduce the workload by avoiding full snapshot ingestion of large source tables within short schedule periods</p></li><li><p>Eliminate the mentioned data integrity risks related to incremental insert or upsert</p></li></ul></li><li><p><strong>Cons</strong>:</p><ul><li><p>More complex to implement due to having to manage two distinct pipelines and the dependencies between them</p></li></ul></li></ul><p></p><h2>4. Conclusion</h2><p>The presented techniques are primarily used when batch and micro-batch data ingestion patterns are needed. That being said there are other mechanisms such as using Change Data Capture (CDC) for ingesting data from relational databases into the central data platform in a more real-time and event-based architectures. However even when using event-based architectures there might still be scenarios where the source system doesn&#8217;t support CDC or its not a viable option and therefore the engineers need to look at alternative techniques for extracting and loading data from such source systems into the central data repository.</p>]]></content:encoded></item></channel></rss>