<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Practical Data Engineering: Data Landscape Digests 🗞️]]></title><description><![CDATA[A round-up of curated articles, trends and news in Data Engineering Landscape]]></description><link>https://www.pracdata.io/s/data-landscape-digests</link><image><url>https://substackcdn.com/image/fetch/$s_!SGaR!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46497e14-ec41-42a0-9067-72715fc9c842_848x848.png</url><title>Practical Data Engineering: Data Landscape Digests 🗞️</title><link>https://www.pracdata.io/s/data-landscape-digests</link></image><generator>Substack</generator><lastBuildDate>Wed, 08 Apr 2026 11:28:30 GMT</lastBuildDate><atom:link href="https://www.pracdata.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Alireza Sadeghi]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[practicaldataengineering@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[practicaldataengineering@substack.com]]></itunes:email><itunes:name><![CDATA[Alireza Sadeghi]]></itunes:name></itunes:owner><itunes:author><![CDATA[Alireza Sadeghi]]></itunes:author><googleplay:owner><![CDATA[practicaldataengineering@substack.com]]></googleplay:owner><googleplay:email><![CDATA[practicaldataengineering@substack.com]]></googleplay:email><googleplay:author><![CDATA[Alireza Sadeghi]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[DLD #5 | Data Landscape Digest 🗞️]]></title><description><![CDATA[S3 Table Abstraction, Airflow's date intervals, Confluent TableFlow, DuckDB Lakehouse analytics, Google Next Keynotes and more.]]></description><link>https://www.pracdata.io/p/dld-5-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-5-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Mon, 21 Apr 2025 11:48:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#10024; Featured - Table Buckets as a New Lakehouse Abstraction</h1><div><hr></div><p>Since AWS announced <strong>S3 Tables</strong> last December, many industry experts have shared their opinions through blogs and social media. Most of the feedback I've read so far has been somewhat negative. Critics like <strong><a href="https://dataengineeringcentral.substack.com/p/aws-s3-tables-the-iceberg-cometh">Daniel</a></strong> argue that S3 Tables create vendor lock-in. They see it as a proprietary solution that limits flexibility and openness required by true Open Lakehouse architectures.</p><p>Critics also highlight that S3 Tables mainly integrate with AWS services such as <strong>Glue</strong>, <strong>Athena</strong>, and <strong>EMR</strong>, lacking sufficient support for general-purpose tools and services.</p><p>Despite these criticisms, broader support for S3 Tables is emerging. <strong>PyIceberg</strong> now supports accessing S3 Tables through Glue Catalog, and <a href="https://aws.amazon.com/blogs/storage/streamlining-access-to-tabular-datasets-stored-in-amazon-s3-tables-with-duckdb/">preview support</a> for querying S3 Tables using Iceberg REST API endpoints, such as Amazon SageMaker Lakehouse REST endpoints, has been added to <strong>DuckDB</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lfmz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lfmz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 424w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 848w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1272w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lfmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png" width="1108" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:1108,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42453,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.pracdata.io/i/161791484?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lfmz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 424w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 848w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1272w, https://substackcdn.com/image/fetch/$s_!lfmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24b5504d-d0e3-40d0-a195-dc7fa1a2f9e4_1108x324.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this context, <strong>Werner Vogels</strong>, Amazon's CTO, recently published another excellent article on his All Things Distributed blog. He reflects on the 19-year evolution of Amazon S3, from simple object storage to a comprehensive data solution. Vogels emphasises how customer feedback has shaped the development of S3's major capabilities.</p><p>Vogels further explains that S3's latest innovation, S3 Tables, directly addresses customer feedback and pain points experienced working with open table formats on S3 object store. Its main goal is to enhance lakehouse data management by resolving these challenges. Vogels also discusses the <strong>ongoing tension between simplicity and velocity</strong>, a common trade-off faced in product development. I highly recommend reading his article. </p><p><strong><a href="https://www.allthingsdistributed.com/2025/03/in-s3-simplicity-is-table-stakes.html">Read the full article here</a></strong></p><div><hr></div><h1>&#128225; Open Source News</h1><div><hr></div><h3>&#128073; Apache Flink 2.0.0 Release</h3><p><strong>Apache Flink 2.0.0</strong> has been officially released, marking the first major Flink release since version 1.x launched nine years ago. Key innovations include disaggregated state management, materialised tables for unified stream-batch processing, and enhanced SQL capabilities. Check out the full highlights of new features and improvements. <strong><a href="https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/">--&gt; Read More</a></strong></p><h3>&#128073; Apache Kafka 4.0 Release</h3><p><strong>Apache Kafka 4.0</strong> is officially out, marking a significant milestone by removing the dependency on Apache ZooKeeper. Key features highlighted by the Confluent blog include a new consumer group protocol for improved rebalance performance, the introduction of Queues for Kafka to support traditional queue semantics, updated Java version requirements, and various enhancements through Kafka Improvement Proposals (KIPs). <strong><a href="https://www.confluent.io/blog/latest-apache-kafka-release/">--&gt; Read More</a></strong></p><h3>&#128073; DuckDB Now Has Its Own Web UI!</h3><p>From <strong>DuckDB v1.2.1</strong> a local web user interface (UI) for DuckDB, has been developed in collaboration with <strong>MotherDuck</strong>, aimed at enhancing user experience by simplifying database interactions. Using the new UI you can now execute SQL queries through interactive notebooks, explore databases and columns with advanced features like syntax highlighting and autocomplete. <strong><a href="https://delta.io/blog/delta-lake-optimize/">--&gt; Read More</a></strong></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h1> &#128736; Practical Data Engineering</h1><div><hr></div><h3>&#128073;  Implementation of Medallion Architecture Using ClickHouse</h3><p>A great hands-on example of implementing a <strong>medallion architecture</strong> using <strong>ClickHouse</strong> to process and analyse free data from the Bluesky social network. It highlights common challenges such as data duplication, malformed JSON, and inconsistent structures. It also demonstrates how ClickHouse features&#8212;including the JSON data type and the ReplacingMergeTree engine&#8212;effectively address these issues.  <strong><a href="https://clickhouse.com/blog/building-a-medallion-architecture-for-bluesky-json-data-with-clickhouse">--&gt; Read More</a></strong></p><h3>&#128073; DuckDB&#8217;s Enhanced Lakehouse Analytics</h3><p>Querying external Delta Lake tables from DuckDB has become significantly easier with recent improvements in the Delta extension. You can now attach to Delta tables and query them using aliases in a simpler and cleaner manner. Additionally, data-skipping enhancements accelerate federated queries over open table formats, demonstrating DuckDB's commitment to advance federated analytics capabilities over lakehouse table formats. <strong><a href="https://duckdb.org/2025/03/21/maximizing-your-delta-scan-performance.html">--&gt; Read More</a></strong></p><h3>&#128073; 21 Reasons to Consider Apache Hudi!</h3><p>While <strong>Apache Iceberg</strong> has emerged as the leading open table format and continues strong into 2025, <strong>Apache Hudi</strong> is making its case by highlighting 21 unique reasons to consider Hudi over Iceberg and Delta Lake. If you're currently evaluating open table formats for your next project or company, this comparison is worth checking out. <strong><a href="https://hudi.apache.org/blog/2025/03/05/hudi-21-unique-differentiators/">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#9881;&#65039; Technical Deep Dive</h1><div><hr></div><h3>&#128073; Parquet Data Skipping Mechanisms</h3><p>This concise article provides an excellent overview of how pruning and data skipping are typically performed on <strong>Parquet</strong> files using metadata and statistics at various levels, including row groups and pages. The described approach aligns with <strong>DataFusion</strong>'s implementation of the Parquet reading and pruning pipeline, but the general design principles apply broadly to other Parquet readers as well. <strong><a href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/">--&gt; Read More</a></strong></p><h3>&#128073; What's the Difference Between Arrow Flight, ADBC, and Arrow IPC? </h3><p>Still confused about the differences between Arrow Flight Protocol, ADBC, and Arrow IPC? This article clearly explains these key technologies within the <strong>Apache Arrow</strong> ecosystem. It also highlights the advantages of using Apache Arrow as a data interchange format, demonstrating how it enables faster and more efficient data exchange compared to traditional methods. <strong><a href="https://arrow.apache.org/blog/2025/02/28/data-wants-to-be-free/">--&gt; Read More</a></strong></p><h3>&#128073; Airflow Data Intervals: A Deep Dive</h3><p>Understanding <strong>Apache Airflow</strong>'s concept of time and intervals&#8212;such as start time, execution time, and logical dates&#8212;can be challenging. This article dives deep into the significance of data intervals in Airflow, clearly explaining their critical role in effectively scheduling and executing workflows. It illustrates how data intervals ensure each DAG run processes complete and accurate datasets, thus promoting idempotency and enabling reliable backfilling of historical data. <strong><a href="https://towardsdatascience.com/airflow-data-intervals-a-deep-dive-15d0ccfb0661/">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128195; Academic Papers</h1><div><hr></div><h3>&#128073; OLAP DBMS Archetype For The Next Ten Years?  </h3><p>An insightful paper co-authored by Michael Stonebraker and Andy Pavlo provides a comprehensive overview of the evolution of database management systems (DBMSs) since 2005. The authors predict the <strong>lakehouse architecture</strong> as the "<strong>OLAP DBMS archetype</strong>" for the coming decade, offering a unified infrastructure capable of supporting both SQL and non-SQL workloads&#8212;a vision initially conceived but not fully realised during the Hadoop era. <strong><a href="https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128188; Data Engineering Career</h1><div><hr></div><h3>&#128073; How Will AI Disrupt Data Engineering?  </h3><p>This is one of the most compelling questions facing software and data engineers today. <strong>Tristan Handy</strong>, founder and CEO of dbt Labs, shares insightful perspectives on how artificial intelligence will impact data engineering roles and what the future might hold for data engineers. <strong><a href="https://roundup.getdbt.com/p/how-ai-will-disrupt-data-engineering">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128170; Skill Up</h1><div><hr></div><h3>&#128073; Learning ClickHouse Fundamentals</h3><p>ClickHouse is offering free 3-hour training sessions on ClickHouse fundamentals scheduled for April 22 and May 15. If you're interested, be sure to sign up! <strong><a href="https://clickhouse.com/company/events/clickhouse-fundamentals">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128227; Vendors News &amp; Announcements</h1><div><hr></div><h3>&#128073; General Availability of Confluent's TableFlow</h3><p><strong>Confluent</strong> has announced the general availability of <strong>TableFlow</strong>, a technology that embraces stream-table duality. TableFlow simplifies the ingestion of event data from Kafka topics into structured lakehouse tables (currently supporting Iceberg and Delta Lake), abstracting away complex data engineering tasks. It manages the entire ingestion lifecycle, including schema evolution and table maintenance on Confluent Platform. <strong><a href="https://www.confluent.io/blog/latest-tableflow/)">--&gt; Read More</a></strong></p><h3>&#128073; Materialize Self-Managed &amp; Free Community Edition</h3><p>The<strong> Materialize</strong> streaming database service provider has introduced two new offerings: a Self-Managed version and a Free Community Edition. The Self-Managed option allows deployment of Materialize in your own environment, providing greater control over performance and compliance. The Community Edition grants free access to Materialize's powerful features with specific usage limits, making it easier to test or run small-scale production workloads. <strong><a href="https://materialize.com/blog/materialize-for-everyone/">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128270; Case Studies</h1><div><hr></div><h3>&#128073; Streaming Data Ingestion Into Cloud Data Warehouse  </h3><p>This insightful case study by <strong>Canva</strong> compares different data ingestion approaches into cloud data warehouses. It contrasts their previous micro-batch, file-based ingestion method (using AWS services like Firehose) with a new architecture using Snowflake's managed <strong>Snowpipe </strong>Ingestion service. The new solution achieves low-latency ingestion of billions of events daily, significantly reducing query latency to under 10 minutes and lowering overall cloud costs. <strong><a href="https://www.canva.dev/blog/engineering/snowpipe-streaming/">--&gt; Read More</a></strong></p><h3>&#128073; Using Key-Value Stores for Exactly-Once Streaming Ingestion  </h3><p>Event deduplication and exactly-once processing guarantees are crucial challenges in building reliable streaming pipelines, often complicated by pipeline failures and network issues. One popular solution involves using an external, fast key-value store to track and eliminate duplicate events during streaming. <strong>MyHeritage</strong> shares their experience implementing exactly-once processing using Spark Structured Streaming and a key-value store for deduplication. <strong><a href="https://medium.com/myheritage-engineering/exactly-once-processing-in-spark-structured-streaming-39eb5ffcaa27">--&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#127909; Conferences &amp; Events</h1><div><hr></div><h3>&#128073; Google Cloud Next '25 Keynotes  </h3><p><strong>Google Cloud Next '25</strong> took place earlier this month, and both the opening keynote and developer keynote are now available on YouTube. The conference again focused heavily on AI, highlighting Google's Gemini LLM foundation model, AI agents, the Agent Development Kit, and new features in Vertex AI. On the analytics front, the new capabilities of Google's Data Science Agent look particularly exciting. Be sure to check out these keynotes to stay updated on the latest trends and technologies from Google Cloud. </p><p><strong>Opening Keynote:</strong></p><div id="youtube2-Md4Fs-Zc3tg)" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Md4Fs-Zc3tg)&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Md4Fs-Zc3tg)?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong>Developer Keynote:</strong></p><div id="youtube2-xLDSuXD8Mls)" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;xLDSuXD8Mls)&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/xLDSuXD8Mls)?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DS7n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png" width="220" height="218.42293906810036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:558,&quot;resizeWidth&quot;:220,&quot;bytes&quot;:136646,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DS7n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 424w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 848w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1272w, https://substackcdn.com/image/fetch/$s_!DS7n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ba33f19-c1ae-44c2-8991-607c54b5229a_558x554.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/in/alirezasadeghi/&quot;,&quot;text&quot;:&quot;Follow Me on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/in/alirezasadeghi/"><span>Follow Me on LinkedIn</span></a></p>]]></content:encoded></item><item><title><![CDATA[DLD #4 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Rise of single-node engines, Postgres + DuckDB, Airflow alerting techniques, BigQuery continuos queries, top data engineering books and more]]></description><link>https://www.pracdata.io/p/dld-4-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-4-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 03 Nov 2024 07:55:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!elGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!elGK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!elGK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!elGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64ff10d0-9864-4056-b75d-1bdb32c41112_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#10024; Featured - The Rise of Single-Node Engines</h1><div><hr></div><p>In a recent <strong><a href="https://motherduck.com/blog/redshift-files-hunt-for-big-data/">blog post</a></strong>, <strong>Jordan Tigani</strong>, MotherDuck co-founder and  former tech lead at Google BigQuery, highlighted that most companies don't actually deal with "<strong>Big Data</strong>." An analysis of half a billion sample queries run on Amazon Redshift revealed that over 80% of the queries processed less than 1 TB of data.</p><p>Even among the small percentage of businesses that do handle Big Data, the majority of queries (95%) are executed on smaller tables. In cases where companies have large datasets, with some tables exceeding 10 TB, about 96% of the queries still target smaller, likely aggregated tables of 100 GB or less, rather than the actual large tables.</p><p>A <strong><a href="https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet">published paper</a></strong> by AWS also notes that most tables contain fewer than a million rows, with the vast majority (98%) having fewer than a billion rows. </p><p>These findings, combined with advancements in software and hardware technology that have significantly enhanced the processing capabilities of single-node systems, suggest that powerful emerging single-node compute engines like <strong>DuckDB</strong> could increasingly handle many non-big data use-cases. This would reduce the need for distributed processing frameworks like <strong>Spark</strong>, provided that the maturity and ecosystem support of these engines continue to evolve.</p><p>In a recent <strong><a href="https://www.linkedin.com/posts/alirezasadeghi_why-single-node-engines-are-gaining-ground-activity-7252649616989450243-sbml?utm_source=share">LinkedIn post</a></strong>, I shared a graph with some of these points, which sparked an engaging discussion among the data engineering community. The conversation featured a mix of opinions on the future of computing, and it's worth checking out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mxt_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mxt_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 424w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 848w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1272w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mxt_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png" width="1201" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1201,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:325504,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mxt_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 424w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 848w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1272w, https://substackcdn.com/image/fetch/$s_!mxt_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd64a88f3-93f6-4f59-ba66-cdbe043acec7_1201x812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#128161;Opinion</h1><div><hr></div><h2>&#128073; The Infamous Rise of Notebook Engineers!</h2><p><strong>Daniel Beach</strong> has penned an intriguing article critiquing the growing use of notebooks in data engineering. He argues that many engineers misuse notebooks due to either a lack of technical skills or encouragement from vendors, like Databricks, promoting questionable practices. While notebooks are valuable for data analysts and scientists engaged in iterative data analysis, Daniel contends they are ill-suited for robust data engineering lifecycles. This misuse leads to poor coding standards, insufficient testing, and inadequate deployment practices. <strong><a href="https://dataengineeringcentral.substack.com/p/the-rise-of-the-notebook-engineer">&#8212;&gt; Read More</a></strong></p><h2>&#128073; The Analytics Personas in Business</h2><p><strong>Tristan Handy</strong>, the founder and CEO of <strong>dbt Labs</strong>, wrote an insightful piece about the key analytics personas in business. He critiques the common approach of treating the analytical process like an assembly line, which often fails to deliver significant insights and ROI. Handy suggests that the personas&#8212;primarily engineers, analysts, and decision-makers&#8212;should be seen as interchangeable "hats" that team members can wear when needed, while still maintaining their primary roles. This flexible approach fosters collaboration and enhances the overall effectiveness of the analytics process. <strong><a href="https://roundup.getdbt.com/p/analytics-personas">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128225; Open Source News</h1><div><hr></div><h3>&#128073; Apache Airflow 2.10 Release</h3><p><strong>Apache Airflow 2.10</strong> was released recently, introducing exciting new features such as support for multiple executors within a single Airflow environment. This allows users to assign different executors, like <code>LocalExecutor</code> and <code>CeleryExecutor</code>, to individual DAGs and even specific tasks. There are also numerous enhancements to Datasets and the UI, which are worth exploring. <strong><a href="https://airflow.apache.org/blog/airflow-2.10.0/">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Development of a New PostgreSQL DuckDB Extension</h3><p><strong>MotherDuck</strong> announced <strong><a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a></strong>, an open-source Postgres extension that embeds the <strong>DuckDB</strong> engine into the Postgres database for running analytical queries on Postgres data. This is a significant step towards easily transforming a popular OLTP system into an HTAP system using an embedded OLAP engine. The development looks promising, with multiple companies such as Microsoft, Neon, and Hydra joining the effort. The beta version has been released recently.  <strong><a href="https://motherduck.com/blog/pgduckdb-beta-release-duckdb-postgres/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Ibis's Default Backend Change</h3><p>The <strong>Ibis</strong> DataFrame library project announced that it will drop the <strong>Pandas</strong> and <strong>Dask</strong> backends in favour of making <strong>DuckDB</strong> its default backend. This decision is due to DuckDB's ease of installation, impressive speed, and strong support within the Python ecosystem. <strong><a href="https://ibis-project.org/posts/farewell-pandas/">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128736; Practical Data Engineering</h1><div><hr></div><h3>&#128073;   Apache Airflow Alerting Techniques</h3><p>The Google Data Analytics blog has provided a comprehensive overview of the alerting hierarchy in <strong>Apache Airflow</strong>, ranging from the top DAG level down to the individual task instance level. It details various alerting mechanisms that can be used to monitor the state of DAG runs and receive notifications about potential failures. The alerting techniques discussed are applicable not only to Google's Cloud Composer managed Airflow service but also to other Airflow deployments. <strong><a href="https://cloud.google.com/blog/products/data-analytics/apache-airflow-hierarchy-and-alerting-options-with-cloud-composer/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Best Practices for Optimising Airflow</h3><p>The AWS blog has covered comprehensive strategies for optimising cost and performance in its Apache Airflow managed service, <strong>Amazon MWAA</strong>. Right-sizing remains crucial for achieving a balanced price-performance ratio in managed services. Amazon MWAA also supports auto-scaling, which can aid in this optimisation. The blog offers additional useful techniques for optimising DAG code to ensure that DAGs remain healthy, efficient, and scalable. These techniques can be applied to any Airflow deployment setup. <strong><a href="https://aws.amazon.com/blogs/big-data/optimize-cost-and-performance-for-amazon-mwaa/">&#8212;&gt; Read More</a></strong></p><p>Speaking of Airflow&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZzhX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZzhX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 424w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 848w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1272w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png" width="500" height="541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18654,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZzhX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 424w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 848w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1272w, https://substackcdn.com/image/fetch/$s_!ZzhX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff376c0fb-3e3b-4c5e-8384-0a53cbb32261_500x541.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1>&#9881;&#65039; Technical Deep Dive</h1><div><hr></div><h3>&#128073; History and Evolution of Block Storage Services at AWS</h3><p>Another fascinating story on the <em><strong>All Things Distributed</strong></em> blog explores the evolution of block storage services offered by Amazon Web Services (AWS). Written by one of AWS's leading engineers, it highlights key milestones in the development of <strong>Elastic Block Store (EBS)</strong>, showcasing enhancements in performance, scalability, and continuous innovation. <strong><a href="https://www.allthingsdistributed.com/2024/08/continuous-reinvention-a-brief-history-of-block-storage-at-aws.html">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  The Internals of Apache Parquet </h3><p><strong>Vu</strong> has authored an insightful article on the internals of <strong>Parquet</strong>, the most popular cloud file format for data lakes. For those working with data lakes, understanding the design and architecture of common serialisation formats is invaluable for optimising storage and queries. <strong><a href="https://blog.det.life/i-spent-8-hours-learning-parquet-heres-what-i-discovered-97add13fb28f)">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  The Future of Distributed Systems and Their Storage Backend</h3><p> A great article by <strong>Colin Breck</strong> on the future of distributed systems and highlighting current challenges, and major trends such as acceleration of object store adoption as the main storage backend abstraction for many analytical and transactional database systems. <strong><a href="https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#128170; Skill Up</h1><div><hr></div><p>In a recent <strong><a href="https://www.linkedin.com/posts/alirezasadeghi_weekend-caffeinated-insights-book-recommendations-activity-7255889495051419648-Oxvm?utm_source=share">LinkedIn post</a></strong>, I shared my top book recommendations for learning data engineering fundamentals. The feedback was overwhelmingly positive, with many comments emphasising the importance of selecting a good book and focusing on mastering the fundamentals. I have personally read all these books in recent years and have gained a lot from them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T49x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T49x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 424w, https://substackcdn.com/image/fetch/$s_!T49x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 848w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1272w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T49x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png" width="1036" height="1578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/834a0814-413e-412a-a13c-e0d272134026_1036x1578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1578,&quot;width&quot;:1036,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1121601,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!T49x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 424w, https://substackcdn.com/image/fetch/$s_!T49x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 848w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1272w, https://substackcdn.com/image/fetch/$s_!T49x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F834a0814-413e-412a-a13c-e0d272134026_1036x1578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>&#128073;  DataCamp&#8217;s Free Week</h3><p><strong>DataCamp</strong> is offering free access to its entire platform and all courses for a week, from November 4 to 10. This is a great opportunity to explore their courses and enhance your skills in the coming week! <strong><a href="https://www.datacamp.com/blog/datacamp-free-access-week">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1> &#128270; Case Studies</h1><div><hr></div><h3>&#128073; Cost-effective Data Analytics Using Deterministic Sampling</h3><p><strong>Meta</strong> has shared valuable insights into its approach for achieving cost-effective data analytics through the use of deterministic sampling. This strategy is designed to balance the cost versus value trade-off, especially as data volumes and computation costs continue to rise exponentially. By employing deterministic sampling, Meta aims to reduce the overall cost and complexity of analytics without compromising the quality of insights. <strong><a href="https://medium.com/@AnalyticsAtMeta/scaling-analytics-instagram-the-power-of-deterministic-sampling-8ee7332d77ae">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Uber's New Declarative Batch ETL Framework</h3><p><strong>Uber</strong> has developed a modular declarative batch ETL framework called <em><strong>Sparkle</strong></em>, which leverages Apache Spark as the compute engine. Sparkle simplifies and standardises ETL pipeline development by allowing users to focus on expressing business logic as a sequence of transformation modules in SQL or Java/Scala/Python. It includes embedded unit testing and marks Uber's transition of all its batch ETL pipelines from Hive to Spark in 2023. <strong><a href="https://www.uber.com/blog/sparkle-modular-etl/">&#8212;&gt; Read more</a></strong></p><h3>&#128073; Self-service Kafka Platform Development Journey</h3><p><strong>Doordash</strong> has shared their journey in developing a self-service Kafka platform, aimed at addressing the challenges of managing Kafka infrastructure efficiently. This initiative was driven by the need to simplify the management of Kafka topics and resources, which was previously hindered by the use of low-level configuration management tools like Terraform. <strong><a href="https://careers.doordash.com/blog/doordash-engineers-with-kafka-self-serve/">&#8212;&gt; Read more</a></strong></p><p></p><div><hr></div><h1> &#128227; Vendors News &amp; Announcements</h1><div><hr></div><h3>&#128073; Continuous Queries on Data Warehouse Systems</h3><p><strong>Google BigQuery</strong> has introduced a significant new feature called <em><strong>BigQuery continuous queries</strong></em>, currently available in Preview. This feature transforms BigQuery from a batch system into an event-driven streaming pipeline, leveraging the concept of <strong>Stream-Table Duality</strong>. It allows for the continuous ingestion of new events as data is loaded into BigQuery, enabling use cases like event-driven data processing, continuous record replication to a pub/sub queue or other streaming storage systems, real-time ML model integration, and Reverse ETL use cases. <strong><a href="https://cloud.google.com/blog/products/data-analytics/bigquery-continuous-queries-makes-data-analysis-real-time/">&#8212;&gt; Read More</a></strong></p><p>The Confluent blog has also <a href="https://www.confluent.io/blog/streaming-bigquery-data-into-confluent-with-continuous-queries/">published</a> an article on leveraging this feature to stream data from BigQuery into the Confluent platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S1ZP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S1ZP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 424w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 848w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1272w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png" width="1456" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;BQ-image1&quot;,&quot;title&quot;:&quot;BQ-image1&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="BQ-image1" title="BQ-image1" srcset="https://substackcdn.com/image/fetch/$s_!S1ZP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 424w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 848w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1272w, https://substackcdn.com/image/fetch/$s_!S1ZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1427e572-9188-47bb-b0fc-044e3e280b01_1999x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>&#128073; New Google Managed Apache Kafka Service</h3><p>Google has also announced a new managed service, Google Cloud Managed Service for <strong>Apache Kafka</strong>. This service abstracts the complexities of deploying and managing a Kafka cluster, offering features like security management, full management of brokers and storage, and automatic horizontal and vertical scaling. It also includes automatic storage tiering and lifecycle management to offload cold data to unlimited cloud storage. <strong><a href="https://cloud.google.com/blog/products/data-analytics/new-managed-service-for-apache-kafka/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Introduction of Conditional Writes on AWS S3</h3><p>AWS recently announced the introduction of "<em><strong>Conditional Writes</strong></em>" on <strong>S3</strong>, marking a significant advancement in enhancing the reliability and efficiency of data operations, especially for distributed applications. This feature ensures that writes occur only if certain conditions are met, reducing the risk of unintentional data overwrites. It allows multiple clients to read and write to the same object without conflicts or concerns about overwriting each other's data. <strong><a href="https://www.infoq.com/news/2024/08/amazon-s3-conditional-writes/">&#8212;&gt; Read More</a></strong></p><p></p><div><hr></div><h1>&#127909; Conferences &amp; Events</h1><div><hr></div><h3>&#128073; Thinking Like an Architect</h3><p><strong>Gregor Hohpe</strong> delivered an insightful presentation titled "Thinking Like an Architect" at <strong>QCon London 2024</strong>. If you're interested in or work with data architecture, his talk provides valuable insights worth exploring. <strong><a href="https://www.infoq.com/presentations/architect-lessons/">&#8212;&gt; Watch</a></strong></p><h3>&#128073; Carnegie Mellon University's Intro to Database Systems Course - Fall 2024</h3><p>The Fall 2024 session of Carnegie Mellon University's renowned "<strong>Intro to Database Systems</strong>" course was commenced in August. You can follow along with the course through recorded lectures available on their <strong><a href="https://www.youtube.com/playlist?list=PLSE8ODhjZXjYDBpQnSymaectKjxCy6BYq">YouTube channel</a>.</strong></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[DLD #3 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Open Catalog War, Latest Apache Kafka and Apache Flink Releases, Airflow Trigger Rules, Lakehouse File Formats and More.]]></description><link>https://www.pracdata.io/p/dld-3-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-3-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 22 Sep 2024 08:37:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uP6V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uP6V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uP6V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uP6V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uP6V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uP6V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fceac6ea7-2495-4d53-9a97-3a15cbb4ad9c_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h1> &#10024; Featured - The New Open Catalog War</h1><div><hr></div><p>In July this year, <strong>Snowflake</strong> <strong><a href="https://www.snowflake.com/en/blog/polaris-catalog-open-source/">open-sourced</a></strong> its <strong>Polaris Catalog</strong> under the Apache 2.0 license, with plans to submit it to the Apache Incubator program. Polaris is a catalog service designed for Apache Iceberg but can extend to other major open formats as well.</p><p><strong>The question is:</strong> why does <strong>Apache Iceberg,</strong> which already has its own metadata layer, need a separate catalog service? </p><p>While <strong>Delta Lake</strong>, <strong>Apache Hudi</strong>, and <strong>Apache Iceberg</strong> open table formats  provide their own metadata, each query engine (such as Spark, Flink, Presto or Trino) must perform separate integrations for tasks like schema discovery and data operations. </p><p>An <em><strong>open</strong></em> <em><strong>unified catalog service</strong></em> like Polaris simplifies this by streamlining multi-engine <em><strong>interoperability</strong></em>. It also offers enhanced features like improved search, data discovery, tagging, and governance, including access control through a <em><strong>unified</strong></em>, <em><strong>open</strong></em>, and <em><strong>vendor-agnostic</strong></em> interface compatible with various storage engines. </p><p>Polaris has the potential to become the standard catalog service for data lakehouse platforms, much like <strong>Hive Metastore</strong> was for Hadoop-based systems. Currently, catalog options include proprietary tools like AWS Glue Catalog and Databricks&#8217; Unity Catalog, which was also open-sourced in June 2024. </p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UyWz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UyWz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 424w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 848w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UyWz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png" width="1456" height="899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:899,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:625076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UyWz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 424w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 848w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!UyWz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee7cdda8-51d3-45e2-bbda-d58f5b7124e2_1920x1185.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Snowflake&#8217;s decision to open-source Polaris may have been influenced by Databricks' move to open source <strong>Unity Catalog</strong>. While some may see this as a new &#8220;<strong>catalog war</strong>&#8221; driven by marketing strategies, as noted by <strong><a href="https://materializedview.io/p/data-lakehouse-catalog-reality-check">Chris</a></strong>, there&#8217;s still hope that these moves will lead to production-ready, open-source catalog services that can finally provide an alternative solution to Hive Metastore after all these years.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h1>&#128225; Open Source News</h1><div><hr></div><h3>&#128073;   Seamless Integration of dbt and Airflow</h3><p>Building and scheduling data pipelines using <strong>dbt</strong> models and workflow orchestration tools like <strong>Airflow</strong> has become a standard practice in data engineering for running transformation workflows. In response to the growing demand for seamless integration between the two systems, <strong>Astronomer</strong> developed a Python package called <strong><a href="https://github.com/astronomer/astronomer-cosmos">Cosmos</a></strong>. This package simplifies running dbt models within Airflow DAGs through a new Airflow task type, <code>DbtDag</code>. The latest version, 1.5.1, was released on July 17, so if you're using dbt and Airflow, it's definitely worth checking out. <strong><a href="https://www.astronomer.io/blog/airflow-dbt-next-chapter/">&#8212;&gt; Read More</a></strong></p><h3>&#128073; What's New in Apache Kafka 3.8.0</h3><p>The release of <strong>Apache Kafka 3.8.0</strong> was recently announced, bringing several new features and improvements. This post on the official Apache Kafka site provides an overview of key updates, including support for compression levels, a new consumer rebalance protocol, and re-bootstrapping capabilities. <strong><a href="https://kafka.apache.org/blog#apache_kafka_380_release_announcement">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Apache Flink 1.20: New Features and the Road to Flink 2.0</h3><p><strong>Apache Flink 1.20</strong> was also recently released, and the Confluent blog highlights the major improvements and features in this update. Notable enhancements include improvements to the bucketing feature for <strong>Flink SQL</strong> tables, allowing users to specify the number of buckets in the <code>DISTRIBUTED BY</code> clause, and the introduction of Flink SQL <strong>Materialised tables</strong>, which are automatically refreshed in the background as data streams in. Additionally, there are various operational improvements. According to reports, this may be the last minor release before Flink 2.0. <strong><a href="https://www.confluent.io/blog/exploring-apache-flink-1-20-features-improvements-and-more/">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1>&#128736; Practical Data Engineering</h1><div><hr></div><h3>&#128073;  Mastering Airflow Trigger Rules</h3><p><strong>Astronomer</strong>, a managed Airflow service, has provided an overview of <strong>Airflow trigger rules</strong> with a visual guide to help new Airflow developers better understand and apply the right trigger rules in their DAGs. For new engineers, grasping all the trigger rules can be challenging, but it's a crucial aspect for effectively managing dependencies between upstream and downstream tasks. <strong><a href="https://www.astronomer.io/blog/understanding-airflow-trigger-rules-comprehensive-visual-guide/">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  A Radical Simplicity Approach to Data Engineering</h3><p>The<strong> tds </strong>blog recently shared a great post about the trade-off between simplicity and functionality in software projects, including data engineering. The author advocates for a philosophy of "<em><strong>Radical Simplicity</strong></em>," where simple, straightforward solutions are prioritised over complex ones. This resonates with me, as I believe complexity should only be introduced when absolutely necessary.  <strong><a href="https://towardsdatascience.com/radical-simplicity-in-data-engineering-86ec3d2bd71c">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Hands-On Guide: Installing and Integrating Polaris OSS</h3><p>If you're interested in installing and testing the latest <strong>Apache Polaris</strong> open-source release, <strong>Dremio</strong> has published a hands-on tutorial that guides you through the installation process and integration with Apache Spark and Apache Iceberg. <strong><a href="https://www.dremio.com/blog/getting-hands-on-with-polaris-oss-apache-iceberg-and-apache-spark/">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1>&#9881;&#65039; Technical Deep Dive</h1><div><hr></div><h3>&#128073;  Parquet vs ORC: Choosing the Right Format for Data Lakehouse</h3><p>In a recent Apache Hudi blog post, the author compares <strong>Parquet</strong> and <strong>ORC</strong>, two of the most popular serialisation frameworks for data lakes and open table formats. The post argues that Parquet delivers better performance for read-heavy, complex analytical use cases where query performance is crucial, while ORC offers a more balanced approach for both read and write performance with superior compression, making it a better fit for general-purpose data storage in Hudi. <strong><a href="https://hudi.apache.org/blog/2024/07/31/hudi-file-formats/">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Overview of Amazon MSK Tiered Storage</h3><p><strong>Amazon</strong> recently published a post, explaining how the new Kafka tiered storage in <strong>Amazon MSK</strong> (Managed Kafka Service) enhances scalability and resiliency. With the new decoupled storage and compute architecture, the system benefits from faster broker recovery, improved load balancing, and virtually unlimited scalability.  <strong><a href="https://aws.amazon.com/blogs/big-data/improve-apache-kafka-scalability-and-resiliency-using-amazon-msk-tiered-storage/">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1> &#128172; Community Discussions</h1><div><hr></div><p>There have been several widely discussed threads including <strong><a href="https://www.reddit.com/r/dataengineering/comments/1efb5xr/this_market_is_seriously_wack/">link</a></strong> and <strong><a href="https://www.reddit.com/r/dataengineering/comments/1edezgh/a_data_engineer_doing_power_bi_stuff/">link</a></strong> on Reddit last month about data engineering roles and the challenges of applying for Data Engineering jobs. Based on comments from both candidates and hiring managers (including CTOs), the situation seems two-fold. </p><p>On one hand, the market is flooded with unqualified candidates&#8212;often with minimal skills from low-quality bootcamps and online courses&#8212;making it hard for companies to find qualified engineers. On the other hand, there are vague and unclear job descriptions, leaving data engineers unsure of what is expected of them once hired. This has created frustration on both sides.</p><div><hr></div><h1> &#128270; Case Studies</h1><div><hr></div><h3>&#128073;   Evolution of Apache Flink Architecture at Airbnb</h3><p><strong>Airbnb</strong> published a post detailing the evolution of their <strong>Apache Flink architecture</strong>. Initially, they deployed Flink jobs on Hadoop YARN with Airflow as the scheduler in 2018. Today, they&#8217;ve moved to deploying Flink jobs on <strong>Kubernetes</strong>, eliminating the need for a job scheduler.  <strong><a href="https://medium.com/airbnb-engineering/apache-flink-on-kubernetes-84425d66ee11">&#8212;&gt; Read More</a></strong></p><h3>&#128073; Pinterest's Adoption of StarRocks for Real-Time Analytics</h3><p><strong>Pinterest</strong> shared their recent adoption of <strong>StarRocks</strong>, a real-time OLAP engine, for their real-time analytics platform. They chose StarRocks for its features like support for standard SQL, joins, sub-queries, and materialized views&#8212;capabilities not readily available in other real-time OLAP engines like Druid. Back in 2021, Pinterest <strong><a href="https://medium.com/pinterest-engineering/pinterests-analytics-as-a-platform-on-druid-part-1-of-3-9043776b7b76">published</a></strong> details about managing a large Druid fleet with 2,000 nodes in a multi-cluster setup. <strong><a href="https://medium.com/pinterest-engineering/delivering-faster-analytics-at-pinterest-a639cdfad374">&#8212;&gt; Read More</a></strong></p><h3>&#128073; The Rise of New Real-time OLAP Engines</h3><p>On that note, while Apache Druid, Pinot, and ClickHouse have dominated the open-source real-time OLAP space in recent years, we&#8217;re now seeing increased adoption of newer engines like <strong>Apache Doris</strong> and <strong>StarRocks</strong>, the latter being a fork of Doris. For a detailed comparison between StarRocks and Doris, check out this <strong><a href="https://medium.com/starrocks-engineering/detailed-comparison-between-starrocks-and-apache-doris-81ddd34be527">blog post</a></strong> from StarRocks Engineering.</p><h3>&#128073; Uber&#8217;s Hadoop-to-Cloud Migration</h3><p><strong>Uber</strong>, which operates one of the largest on-premise <strong>Hadoop</strong> clusters, has recently begun migrating to the cloud, starting with a key architectural shift&#8212;replacing the HDFS file system with Google Cloud Storage, while still running the rest of their stack on IaaS. One of the challenges in migrating from Hadoop to the cloud is transitioning Hadoop&#8217;s security features, such as delegation tokens and Kerberos authentication, to Google Cloud&#8217;s token-based security. Uber discusses how they tackled these security migration challenges in this article. <strong><a href="https://www.uber.com/en-AU/blog/securing-hadoop-on-gcp/">&#8212;&gt;</a></strong><a href="https://www.uber.com/en-AU/blog/securing-hadoop-on-gcp/"> </a><strong><a href="https://www.uber.com/en-AU/blog/securing-hadoop-on-gcp/">Read More</a></strong></p><div><hr></div><h1> &#128227; Vendors News &amp; Announcements</h1><div><hr></div><h3>&#128073; Fivetran Integration with Snowflake&#8217;s Polaris Catalog</h3><p>Just days after Snowflake open-sourced the Polaris catalog service on GitHub, <strong>Fivetran</strong>, a leading SaaS provider for data integration, announced its upcoming integration with the newly open-sourced <strong>Polaris</strong> data catalog. This integration aims to develop a managed catalog solution for Fivetran&#8217;s <strong>Managed Data Lake Service</strong>. <strong><a href="https://www.fivetran.com/blog/unlock-catalog-interoperability-with-fivetran-and-polaris">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Databricks LakeFlow Connect for Automated Data Ingestion</h3><p><strong>Databricks</strong> announced the public preview of <strong><a href="https://www.databricks.com/product/data-ingestion">LakeFlow Connect</a></strong>, an automated incremental data ingestion service for sources like SQL Server and Salesforce. Built on <strong>Delta Live Tables</strong>, LakeFlow Connect enables incremental data ingestion using CDC (Change Data Capture). This marks another step by major vendors toward automating data engineering tasks. <strong><a href="https://www.databricks.com/blog/ingest-data-sql-server-salesforce-and-workday-lakeflow-connect">&#8212;&gt; Read More</a></strong></p><h3>&#128073;  Databricks Lakehouse Federation Across AWS, Azure, and GCP</h3><p><strong>Databricks</strong> also announced the general availability of <strong>Lakehouse Federation</strong> in Unity Catalog across AWS, Azure, and GCP last month. This mirrors the strategy of other top cloud vendors to offer a unified analytical platform with centralised data discovery and governance, providing a unified view of enterprise data across multiple storage engines and cloud platforms. <strong><a href="https://www.databricks.com/blog/announcing-general-availability-lakehouse-federation">&#8212;&gt; Read More</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TYor!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TYor!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 424w, https://substackcdn.com/image/fetch/$s_!TYor!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 848w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TYor!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png" width="1456" height="1549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1549,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209085,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TYor!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 424w, https://substackcdn.com/image/fetch/$s_!TYor!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 848w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!TYor!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F498481cf-3888-4a4d-b485-4703a1727c7d_1504x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>&#128073;  ClickHouse Acquisition of PeerDB for Real-Time Postgres Ingestion</h3><p><strong>ClickHouse, Inc.</strong> announced the acquisition of <strong><a href="https://peerdb.io/">PeerDB</a></strong>, a provider of Change Data Capture (CDC) for <strong>Postgres</strong> databases. This move aims to integrate and streamline real-time data ingestion from transactional databases like Postgres into the ClickHouse OLAP engine. <strong><a href="https://clickhouse.com/blog/clickhouse-welcomes-peerdb-adding-the-fastest-postgres-cdc-to-the-fastest-olap-database">&#8212;&gt; Read More</a></strong></p><div><hr></div><h1> &#127909; Conferences &amp; Events</h1><div><hr></div><h3>&#128073; Kafka Current 2024 Key Notes</h3><p>The two-day <strong>Kafka Current 2024</strong> event, organised by Confluent, took place last week in Austin. Keynotes from both Day 1 and Day 2 have already been published on YouTube: <strong><a href="https://www.youtube.com/watch?v=Sn6fVsOzrSU">Keynote Day 1</a> | <a href="https://www.youtube.com/watch?v=ccupkhcLioM">Keynote Day 2</a></strong></p><p></p><h3>&#128073;  Open Source Data Summit Virtual Conference</h3><p>The <strong>Open Source Data Summit</strong> Virtual Conference will be held on October 2nd. If you're interested, you can register for free at <strong><a href="https://opensourcedatasummit.com/">opensourcedatasummit.com</a></strong>. The event will feature numerous discussions on data lakehouses and the role of open table formats in modern data architectures.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[DLD #2 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Curated Knowledge on Data Engineering Landscape]]></description><link>https://www.pracdata.io/p/dld-2-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-2-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sat, 31 Aug 2024 16:16:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h2>&#10024; Featured - Netflix Maestro Workflow Engine</h2><div><hr></div><p>The crowded field of open-source engines has just welcomed a new player: <strong>Maestro</strong>, recently open-sourced by <strong>Netflix</strong>!</p><p>Netflix asserts that Maestro is a highly scalable and flexible scheduler capable of managing large-scale heterogeneous workflows, including ML training and data pipelines. It supports flexible execution logic, such as Docker images and notebooks, and accommodates various workflow patterns, including cyclic and acyclic (DAGs). </p><p>One of its standout features is the <em><strong>foreach</strong></em><strong> pattern</strong>, which is particularly useful for repetitive tasks like ML model training and data backfilling&#8212;something that would typically require separate job runs on a scheduler like <strong>Airflow</strong> to backfill daily ingested source data. Maestro also offers multiple <strong>domain-specific languages (DSLs)</strong> for defining workflows declaratively using YAML files, a feature that would need to be custom-built on top of Airflow.</p><p>Since its open-source release in July, the project has already garnered 3,000 stars on <a href="https://github.com/Netflix/maestro">Github</a>. Netflix had <strong><a href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">previously covered</a></strong> the internals and use cases implemented with this engine, and their <strong><a href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78">latest blog post</a></strong> provides a comprehensive overview of Maestro's features and supported workflow patterns.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8N2F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8N2F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 424w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 848w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1272w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8N2F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png" width="510" height="367.78846153846155" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1050,&quot;width&quot;:1456,&quot;resizeWidth&quot;:510,&quot;bytes&quot;:351926,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8N2F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 424w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 848w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1272w, https://substackcdn.com/image/fetch/$s_!8N2F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F963fe2fd-7a94-4f8e-8cd2-633ea7c8645d_3152x2274.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Additionally, this <strong><a href="https://blog.det.life/netflix-maestro-and-apache-airflow-competitors-or-companions-in-workflow-orchestration-2bce948956a5">blog post</a></strong> provides a thorough comparison between <strong>Airflow</strong> and <strong>Maestro</strong>, complete with practical examples and code snippets. </p><p></p><div><hr></div><h2>&#128161; Trends &amp; Insight</h2><div><hr></div><h4>&#128073; The State of Modern Data Stack in 2024</h4><p>In 2024, there has been much discussion about the decline of the <strong>Modern Data Stack (MDS)</strong>. Concerns have been raised about the economics of the Modern Data Stack, and the term itself is being recycled, much like previously hyped concepts such as "Big Data". Some experts believe that many MDS startups are <strong><a href="https://joereis.substack.com/p/everything-ends-my-journey-with-the">doomed to extinction</a></strong>. As <strong><a href="https://mattturck.com/mad2024/">Matt Turck</a></strong> pointed out, the Modern Data Stack was largely a <em>marketing concept and an alliance among several startups across the data value chain</em>. In a recent blog post, <strong>Ananth</strong> explores the history, decline, and the emerging <strong>post-MDS</strong> era. <strong><a href="https://www.dataengineeringweekly.com/p/a-brief-history-of-modern-data-stack">&#8212;&gt; Read more</a></strong></p><h4>&#128073; Embracing '<em>Bring Your Own Compute</em>' with DuckDB</h4><p> An interesting discussion took place between <strong>MotherDuck</strong>&#8217;s co-founder and CEO and <strong>Fivetran</strong>&#8217;s co-founder and CEO about the future of big data and single-node or laptop-sized analytics using <strong>DuckDB</strong>. With advancements in hardware, we might witness a new shift towards <em>local execution</em>, where computing is fully or partially pushed to the user&#8217;s machine, introducing the concept of <em><strong>Bring Your Own Compute</strong></em> for analytics. <strong><a href="https://www.fivetran.com/blog/the-future-of-big-data-processinghttps://www.fivetran.com/blog/the-future-of-big-data-processing-may-be-laptop-sized-may-be-laptop-sized">&#8212;&gt; Watch the interview</a></strong></p><div><hr></div><h2>&#128225; Open Source News</h2><div><hr></div><h4>&#128073;  Apache Kafka 3.8 Release</h4><p><strong>Apache Kafka 3.8</strong> has been Released. Confluent blog provided a summary of the the new features and improvements in this release. <strong><a href="https://www.confluent.io/blog/introducing-apache-kafka-3-8/">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Delta Lake 4.0 New Features</h4><p><strong>Delta Lake 4.0</strong> Preview was <strong><a href="https://delta.io/blog/delta-lake-4-0/">announced</a></strong> in June. A blog post highlights the new important features such as Change Data Feed (CDF), Liquid Clustering and new monitoring features of this release with some practical examples. <strong><a href="https://medium.com/@vishalbarvaliya/delta-lake-4-0-a-simple-guide-842c3afbcd06">&#8212;&gt; Read More</a></strong></p><h4>&#128073; DAG Factory Project Takeover by Astronomer</h4><p> <strong>Astronomer</strong> has announced taking over the open-source project<a href="https://github.com/astronomer/dag-factory"> </a><strong><a href="https://github.com/astronomer/dag-factory">DAG Factory</a></strong>, a Python library for authoring Airflow DAGs declaratively using YAML configuration files. Providing a thin no-code abstraction layer on top of Airflow has become a common practice among tech companies to reduce the engineering effort required to create DAGs, standardise pipeline creation for common use cases like data transformation, and make the process more self-service. <strong><a href="https://www.astronomer.io/blog/astronomer-adopts-dag-factory-democratize-writing-data-pipelines/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128736; Practical Data Engineering</h2><div><hr></div><h4>&#128073; Consistent Data Modeling and Naming Conventions</h4><p> Implementing a consistent data modeling framework, such as standardised naming conventions, is crucial to maintaining a healthy data platform and ensuring long-term scalability, even when there is turnover among engineers. Mike discusses some of the key aspects and best practices for data warehouse modeling, including effective naming conventions for tables and schemas. <strong><a href="https://towardsdatascience.com/advanced-data-modelling-1e496578bc91">&#8212;&gt; Read more</a></strong></p><h4>&#128073; State of CI/CD for Data Pipelines</h4><p><strong>LakeFS</strong> published a comprehensive overview of implementing Continuous Integration/Continuous Delivery (CI/CD) for data pipelines, focusing on the <strong>Write-Audit-Publish (WAP)</strong> ingestion pattern. The article explores various options and tools available in the market, offering insights into how to effectively integrate CI/CD practices into data workflows. <strong><a href="https://lakefs.io/blog/cicd-pipeline-guide/">--&gt; Read more</a></strong></p><h4> &#128073; Data Reconciliation Techniques and Best Practices</h4><p><strong>Datafold</strong> has published a three-part series on <strong>data reconciliation</strong>, a crucial subset of data quality. The series covers use cases, techniques, challenges, and best practices for performing data reconciliation across data sources and targets, with the goal of ensuring data accuracy and completeness. <strong><a href="https://www.datafold.com/blog/what-is-data-reconciliation">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  dbt Beyond the Marketing Hype</h4><p>There have been many blog posts and discussions about the hype <strong>dbt</strong> has generated over the past few years. The author of this <strong><a href="https://blog.det.life/no-data-engineers-dont-need-dbt-30573eafa15e">blog post</a></strong> takes a different approach, discussing the challenges of performing data transformation in data warehouses and how dbt can address them. It starts with real problems and then explores how tooling, specifically dbt, provides solutions&#8212;rather than starting with the tool (because it's popular and everyone is talking about it) and then searching for problems it can solve. The post also offers a clear definition of what dbt actually does:</p><blockquote><p> <strong>dbt works by abstracting common data-warehouse patterns into config-driven automation and providing a suite of tools to simplify SQL transformations, tests, and documentation.</strong></p></blockquote><p></p><div><hr></div><h2>&#9881;&#65039; Technical Deep Dive</h2><div><hr></div><h4>&#128073; Evolution of Debezium's Internal Engine</h4><p> While in many streaming CDC data architectures, <strong>Debezium</strong> plugins are primarily used within the Kafka Connect framework and runtime, it is also possible to run Debezium connectors outside the Kafka ecosystem. This can be done by embedding the Debezium engine in internal applications or by using the standalone <strong>Debezium Server</strong>, which is now a separate project on <strong><a href="https://github.com/debezium/debezium-server/">GitHub</a></strong>. This Debezium blog discusses the evolution of Debezium's internal engine, starting with the initial <code>EmbeddedEngine</code> implementation, which was mainly built for testing, and the new <code>AsyncEmbeddedEngine</code>, which addresses the shortcomings of the previous implementation. <strong><a href="https://debezium.io/blog/2024/07/08/async-embedded-engine/">&#8212;&gt; Read more</a></strong></p><h4>&#128073; A guide on Concurrency Levels of Apache Airflow</h4><p>One of the most confusing aspects of the <strong>Apache</strong> <strong>Airflow</strong> engine, especially for newcomers, is how concurrency is applied at different levels, such as the Airflow scheduler, DAG level, and task level, and how their combination can impact overall workflow performance. The configuration parameters in the config file of early released versions added to this confusion, with names that often seemed unrelated. In a recent blog post, <strong>Google</strong> provides a comprehensive overview of the various concurrency levels in Airflow, with a particular focus on its managed Airflow service. <strong><a href="https://cloud.google.com/blog/products/data-analytics/airflow-dag-and-task-concurrency-in-cloud-compose">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Apache Airflow Software Architecture </h4><p>A helpful guide posted by Apache Airflow's blog with visuals that illustrates the key underlying components of the Apache Airflow software architecture and how they interact within the system. <strong><a href="https://medium.com/apache-airflow/airflow-architecture-simplified-3d582fc3ccb0">&#8212;&gt; Read more</a></strong></p><p>Speaking of Airflow, do you...!?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COx4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COx4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 424w, https://substackcdn.com/image/fetch/$s_!COx4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 848w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1272w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png" width="640" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COx4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 424w, https://substackcdn.com/image/fetch/$s_!COx4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 848w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1272w, https://substackcdn.com/image/fetch/$s_!COx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84b4978a-ccc8-46e4-a162-84b4d8b70a21_640x360.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h4>&#128073; Bringing GenAI and LLMs to Flink Streaming Pipelines</h4><p><strong>Confluent</strong> blog provided an overview of a new <strong>Flink AI </strong>feature, which allows streaming data pipelines to invoke AI models, including generative AI (GenAI) large language model (LLM) endpoints (such as OpenAI and Google Vertex AI), directly from Flink SQL statements. This enables tasks like AI model inference, regression, and classification to be seamlessly integrated into real-time data processing workflows. <strong><a href="https://www.confluent.io/blog/flinkai-realtime-ml-and-genai-confluent-cloud/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128270; Case Studies</h2><div><hr></div><h4>&#128073; Pinterest's Migration from HBase to TiDB</h4><p><strong>Pinterest</strong> shared their journey of replacing <strong>HBase</strong> storage with a modern, scalable open-source database system that meets their requirements for reliability, performance, tunable consistency, and robust CDC support. They ultimately chose <strong>TiDB</strong> as the solution. We are seeing more stories of companies exploring alternatives to HBase due to its limitations and maintenance overhead. <strong><a href="https://medium.com/pinterest-engineering/tidb-adoption-at-pinterest-1130ab787a10">&#8212;&gt; Read more</a></strong></p><h4>&#128073; Slack&#8217;s Migration to EMR 6</h4><p><strong>Slack</strong> discusses their migration from EMR 5 with Spark 2 to EMR 6 with Hive 3 and Spark 3 on AWS, highlighting the performance and reliability improvements achieved in their data pipelines, which are developed using Apache Spark and scheduled on Airflow. <strong><a href="https://slack.engineering/unlocking-efficiency-and-performance-navigating-the-spark-3-and-emr-6-upgrade-journey-at-slack/">&#8212;&gt; Read more</a></strong></p><p></p><div><hr></div><h2>&#128172; Community Discussions</h2><div><hr></div><p>There was a recent <strong><a href="https://www.reddit.com/r/dataengineering/comments/1eajbke/how_fast_data_engineering_is_moving_forward/">discussion on Reddit</a></strong> on <strong>how fast data engineering is progressing</strong>. The consensus among most commenters is that while tools, storage systems, and processing frameworks may evolve rapidly, the fundamentals of data engineering remain consistent. These fundamentals include data integration, data modeling, and the processes of extracting, transforming, and loading data (ETL).</p><p>For <strong>aspiring data engineers</strong>, the key takeaway is to invest time and effort in mastering these basics and fundamentals rather than focusing solely on becoming an expert in specific tools. While vendors are striving to automate the data engineering lifecycle as much as possible (as seen with the latest Databricks offerings), a strong understanding of the fundamentals will always be valuable and help you stand out in the field.</p><p></p><div><hr></div><h2>&#128227; Vendors News &amp; Announcements</h2><div><hr></div><h4>&#128073; Snowflake's New Cortex Search Feature</h4><p><strong>Snowflake</strong> announced a new feature called <strong>Cortex Search</strong> (currently in Public Preview) in July 2024. This search service is designed for unstructured data, such as text, and enables enterprises to deploy <strong>Retrieval-Augmented Generation (RAG)</strong> applications using Snowflake, allowing them to customise generative AI applications with proprietary data. <strong><a href="https://www.snowflake.com/en/blog/cortex-search-ai-hybrid-search/">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Databricks Mosaic AI Model Training</h4><p>Around the same time, <strong>Databricks</strong> announced the support for <strong>Mosaic AI Model Training</strong>, which streamlines the fine-tuning of general-purpose open-source LLM and GenAI models, such as Llama 3 and Mistral, using enterprise data. Databricks recommends a new approach for training LLM models with enterprise data called <a href="https://arxiv.org/abs/2403.10131">Retrieval Augmented Fine-tuning (RAFT)</a>, which combines both Retrieval-Augmented Generation (RAG) and model fine-tuning. <strong><a href="https://www.databricks.com/blog/introducing-mosaic-ai-model-training-fine-tuning-genai-models">&#8212;&gt; Read more</a></strong></p><h4>&#128073; Release of Confluent Platform 7.7</h4><p><strong>Confluent</strong> announced the release of Confluent Platform 7.7, built on Apache Kafka 3.7. This update introduces significant features, including <strong>Confluent Platform for Apache Flink</strong>, a fully managed and serverless stream processing service (currently in Limited Availability), as well as a self-managed HTTP Source connector for ingesting data from external APIs. <strong><a href="https://www.confluent.io/blog/introducing-confluent-platform-7-7/">&#8212;&gt; Read more</a></strong></p><h2></h2><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[DLD #1 | Data Landscape Digest 🗞️]]></title><description><![CDATA[Curated Knowledge on Data Engineering Landscape]]></description><link>https://www.pracdata.io/p/dld-1-data-landscape-digest</link><guid isPermaLink="false">https://www.pracdata.io/p/dld-1-data-landscape-digest</guid><dc:creator><![CDATA[Alireza Sadeghi]]></dc:creator><pubDate>Sun, 25 Aug 2024 12:26:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLRn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KLRn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KLRn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ae25d2-f682-4a9c-8534-27d46ed7b084_1536x1076.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Introduction</h2><p>Welcome to the <strong>Data Engineering Digest (DLD)</strong> newsletter series. This will be a periodic roundup of the latest and greatest in the world of data &amp; data engineering in particular. We'll deliver a curated selection of finest news articles, blog posts, tutorials, discussions, and more within the data landscape.</p><div><hr></div><h2>&#10024; Featured</h2><div><hr></div><p>In a recent <strong><a href="https://practicaldataengineering.substack.com/i/147610179/non-open-vs-open-data-lakehouse">blog post</a></strong> I explored the difference between generic <strong>data lakehouses</strong> offered by some vendors, and <strong>open data lakehouses</strong> built on the foundation of open source tools and technologies. The Apache Hudi blog published a post providing an overview of the <strong>open data lakehouse architecture</strong>. It details the architecture layers, components, key technologies, advantages over previous data lake architectures, and use cases for implementing an open data lakehouse.  <strong><a href="https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/">&#8212;&gt; Read more</a></strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NMyZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NMyZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 424w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 848w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1272w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png" width="1456" height="789" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:789,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;/assets/images/blog/dlh_new.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="/assets/images/blog/dlh_new.png" title="/assets/images/blog/dlh_new.png" srcset="https://substackcdn.com/image/fetch/$s_!NMyZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 424w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 848w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1272w, https://substackcdn.com/image/fetch/$s_!NMyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779b535b-e9bc-4021-8891-1436a2489507_2323x1259.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>&#128225; Open Source News</h2><div><hr></div><h4>&#128073;  Debezium Latest Official Release</h4><p><strong>Debezium 2.7.0.Final</strong> has been released. Some outstanding 140 issues have been fixed along with many new features and improvements in the core component as well as the stand-alone connectors.  <strong><a href="https://debezium.io/blog/2024/07/01/debezium-2-7-final-released/">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Submission of OneTable to Apache Foundation</h4><p><strong>OneHouse</strong> news on submission of <strong>XTable</strong> (formerly known as OneTable) to the Apache Software Foundation Incubator was a major recent announcement. Top cloud vendors such as Microsoft and Google are already integrating XTable into their analytics platforms, <strong>Microsoft Fabric</strong> and <strong>BigLake</strong> respectively,  to provide a unified logical lakehouse with interoperability between different open table formats. <strong><a href="https://www.onehouse.ai/blog/open-data-foundations-with-apache-xtable-hudi-delta-and-iceberg-interoperability">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128736; Practical Data Engineering</h2><div><hr></div><h4>&#128073; Spark Repartition() Function</h4><p>Spark pipelines often employ the <code>df.repartition()</code> function to optimise data processing, especially by consolidating small partitions before loading data into target storage. It's essential to remember that repartitioning in Spark is a <em>unit of parallelism</em> and data distribution used by the distributed compute engine, not necessarily a full bucketing or SQL-style <code>group by</code> operation. A blog post explains how Spark repartitioning works and what it actually does. <strong><a href="https://python.plainenglish.io/the-truth-about-pysparks-repartition-prepare-to-be-surprised-4dede792f3f4">--&gt; Read more</a></strong></p><h4>&#128073; Apache XTable + Airflow</h4><p><strong>AWS</strong> published a blog post demonstrating how to use <strong>Apache XTable</strong> to convert open table format metadata to other formats. The blog post features a custom Airflow operator, <code>XtableOperator()</code>, designed for batch pipeline translations on the AWS platform. The operator's code is available on <strong><a href="https://github.com/aws-samples/apache-xtable-on-aws-samples">Github</a></strong>. This development suggests that <em>unified open table format</em> adoption is gaining some momentum. <strong><a href="https://aws.amazon.com/blogs/big-data/run-apache-xtable-on-amazon-mwaa-to-translate-open-table-formats/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#9881;&#65039; Technical Deep Dive</h2><div><hr></div><h4>&#128073;  Apache Paimon&#8217;s Internal Design</h4><p>While the 'big three' open table formats &#8211; Apache Hudi, Iceberg, and Delta Lake &#8211; dominate the market and discussions, <strong>Apache Paimon</strong>, a more recent &#8220;<strong>Flink table format</strong>&#8221;, has received less attention. If you're interested in learning more about Apache Paimon, a comprehensive blog post by Giannis delves into its design goals, internals, and key features. <strong><a href="https://medium.com/@ipolyzos_/the-majesty-of-apache-flink-and-paimon-d36e73571fc9">&#8212;&gt; Read more</a></strong></p><p></p><h4> &#128073;  How dlt Works Under the Hood</h4><p>Discussions have been ongoing regarding the potential use of new open source data ingestion tools like <strong>Data Load Tool (dlt)</strong> as a replacement for more established ones like <strong>Airbyte</strong> in certain use cases (ex API data integration). A dlt project contributor has published a blog post detailing the internal data pipeline design and core functions of dlt, including data extraction, normalisation, and loading. The latest version leverages Apache Arrow's efficient in-memory data structure to optimise the entire pipeline. <strong><a href="https://dlthub.com/blog/how-dlt-uses-apache-arrow">&#8212;&gt; Read more</a></strong></p><p></p><h4>&#128073;  Yet Another Kafka Explanation</h4><p>There's a wealth of resources available explaining Kafka's architecture and internals. I found a recent blog post series on the topic, which provides clear and concise explanations of the concepts, accompanied by helpful visuals for those unfamiliar with Kafka's design and operation.</p><p><strong><a href="https://blog.det.life/apache-kafka-overview-b04c4ab8ef49">Kafka Architecture Overview</a></strong> | <strong><a href="https://blog.det.life/apache-kafka-important-designs-2a0e6aa6c5bf">Design elements</a></strong> | <strong><a href="https://blog.det.life/apache-kafka-producer-db3b177f65d2">Kafka Producer</a></strong> | <strong><a href="https://blog.det.life/apache-kafka-consumer-d902e3589679">Kafka Consumer</a></strong></p><p></p><h4>&#128073; DuckDB's Internal Memory and Buffer Management</h4><p>If you've used <strong>DuckDB</strong> or are exploring its capabilities, you might wonder how it handles large datasets without memory limitations, a common issue with some Python dataframes like Pandas. DuckDB&#8217;s official blog has recently covered the engine's internal memory and buffer management. It explains how DuckDB leverages streaming execution to process queries without fully loading CSV or Parquet files into memory, and utilises disk spilling when intermediate results exceed memory capacity. <strong><a href="https://duckdb.org/2024/07/09/memory-management.html">&#8212;&gt; Read more</a></strong></p><p></p><h4>&#128073;  Snowflake's Micro Partitioning Internal Design</h4><p>If you've been using <strong>Snowflake</strong> at your company, you're likely familiar with its internal partitioning feature called <em><strong><a href="https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions">micro-partitioning</a></strong></em>. This process automatically divides tables into micro-partitions of 50 MB and 500 MB, organising data in a columnar format within each micro-partition. A concise and excellent blog post provides a clear explanation of micro-partitioning's internal design, complete with helpful visuals. <strong><a href="https://medium.com/@saisuman.singamsetty/snowflake-micro-partitions-the-future-of-data-storage-and-retrieval-4e41312708b4">&#8212;&gt; Read more</a></strong></p><p></p><h4> &#128073; Overview of Kafka's Tiered Storage Design</h4><p><strong>Kafka 3.6</strong>, released in 2023, introduced a highly anticipated feature: <strong>Tiered Storage</strong>. This feature currently supports <em>local</em> and <em>remote</em> storage tiers, enabling the movement of inactive segments to a configurable deep storage solution like HDFS or S3 based on local retention settings. This provides a cost-effective and scalable way to retain historical data. <strong>Uber</strong> is credited with driving the tiered storage proposal <strong><a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage">[KIP-405]</a></strong>, discussing the internals of the tiered storage architecture. <strong><a href="https://www.uber.com/en-AU/blog/kafka-tiered-storage/">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128270; Case Studies</h2><div><hr></div><h4> &#128073;  Implementation of dlt in Production</h4><p>It's inspiring to hear about individuals and teams embracing new open-source tools and technologies in production environments. <strong>Data Load Tool (dlt)</strong> is a relatively new ETL library covered earlier, that has been adopted by some for production workloads. Alexander from Dataops explores the advantages and disadvantages of using dlt compared to more established data integration tools like Airbyte. <strong><a href="https://medium.com/dataops-tech/data-load-tool-dlt-pros-cons-and-integration-into-data-platform-as-an-ingest-tool-eb34311a2007">&#8212;&gt; Read more</a></strong></p><h4>&#128073;  Notion's Data Architecture Evolution</h4><p><strong>Notion</strong> has unveiled their new data lakehouse architecture. They chose Apache Hudi as their table format due to its efficient incremental data ingestion capabilities, making it suitable for their update-heavy workloads. The architecture also incorporates event-based CDC ingestion using Debezium and Kafka. <strong><a href="https://www.notion.so/blog/building-and-scaling-notions-data-lake">&#8212;&gt; Read more</a></strong></p><div><hr></div><h2>&#128172; Community Discussions</h2><div><hr></div><p>&#128073; This <strong><a href="https://www.reddit.com/r/dataengineering/comments/1e7fcmx/what_i_would_do_if_had_to_relearn_data/">career advice</a></strong> from a senior data engineer highlights a key point in one of Reddit's highest-rated data engineering discussions in July: </p><div class="pullquote"><p><strong>Master the data engineering fundamentals first!</strong></p></div><p>While flashy tools and platforms come and go, a strong foundation in low-level skills like Bash, Git, SQL, pure Python development, and containerisation will take you much further. Juniors who prioritise these foundational skills before diving into advanced tools and stacks will be better-positioned for success.</p><div><hr></div><h2> &#127909; Conferences &amp; Events</h2><div><hr></div><p>&#128073;  The annual virtual <strong>PrestoCon 2024</strong> day organised by Linux Foundation/Presto Foundation took place in June, discussing topics like Presto 2.0 native C++ engine, and Presto usage at companies like Uber. A recap of the event and the main sessions is provided in this <strong><a href="https://prestodb.io/blog/2024/07/02/recap-of-prestocon-day-2024-presto-c-performance-new-connectors-use-cases-and-so-much-more/">article</a></strong>. All the recorded 24 sessions can be found on <strong><a href="https://www.youtube.com/playlist?list=PLJVeO1NMmyqUUj2UbRiwX8-Pmc7RNwDcY">Youtube</a></strong>.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.pracdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Practical Data Engineering Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>