DLD #5 | Data Landscape Digest ๐๏ธ
S3 Table Abstraction, Airflow's date intervals, Confluent TableFlow, DuckDB Lakehouse analytics, Google Next Keynotes and more.
โจ Featured - Table Buckets as a New Lakehouse Abstraction
Since AWS announced S3 Tables last December, many industry experts have shared their opinions through blogs and social media. Most of the feedback I've read so far has been somewhat negative. Critics like Daniel argue that S3 Tables create vendor lock-in. They see it as a proprietary solution that limits flexibility and openness required by true Open Lakehouse architectures.
Critics also highlight that S3 Tables mainly integrate with AWS services such as Glue, Athena, and EMR, lacking sufficient support for general-purpose tools and services.
Despite these criticisms, broader support for S3 Tables is emerging. PyIceberg now supports accessing S3 Tables through Glue Catalog, and preview support for querying S3 Tables using Iceberg REST API endpoints, such as Amazon SageMaker Lakehouse REST endpoints, has been added to DuckDB.
In this context, Werner Vogels, Amazon's CTO, recently published another excellent article on his All Things Distributed blog. He reflects on the 19-year evolution of Amazon S3, from simple object storage to a comprehensive data solution. Vogels emphasises how customer feedback has shaped the development of S3's major capabilities.
Vogels further explains that S3's latest innovation, S3 Tables, directly addresses customer feedback and pain points experienced working with open table formats on S3 object store. Its main goal is to enhance lakehouse data management by resolving these challenges. Vogels also discusses the ongoing tension between simplicity and velocity, a common trade-off faced in product development. I highly recommend reading his article.
๐ก Open Source News
๐ Apache Flink 2.0.0 Release
Apache Flink 2.0.0 has been officially released, marking the first major Flink release since version 1.x launched nine years ago. Key innovations include disaggregated state management, materialised tables for unified stream-batch processing, and enhanced SQL capabilities. Check out the full highlights of new features and improvements. --> Read More
๐ Apache Kafka 4.0 Release
Apache Kafka 4.0 is officially out, marking a significant milestone by removing the dependency on Apache ZooKeeper. Key features highlighted by the Confluent blog include a new consumer group protocol for improved rebalance performance, the introduction of Queues for Kafka to support traditional queue semantics, updated Java version requirements, and various enhancements through Kafka Improvement Proposals (KIPs). --> Read More
๐ DuckDB Now Has Its Own Web UI!
From DuckDB v1.2.1 a local web user interface (UI) for DuckDB, has been developed in collaboration with MotherDuck, aimed at enhancing user experience by simplifying database interactions. Using the new UI you can now execute SQL queries through interactive notebooks, explore databases and columns with advanced features like syntax highlighting and autocomplete. --> Read More
๐ Practical Data Engineering
๐ Implementation of Medallion Architecture Using ClickHouse
A great hands-on example of implementing a medallion architecture using ClickHouse to process and analyse free data from the Bluesky social network. It highlights common challenges such as data duplication, malformed JSON, and inconsistent structures. It also demonstrates how ClickHouse featuresโincluding the JSON data type and the ReplacingMergeTree engineโeffectively address these issues. --> Read More
๐ DuckDBโs Enhanced Lakehouse Analytics
Querying external Delta Lake tables from DuckDB has become significantly easier with recent improvements in the Delta extension. You can now attach to Delta tables and query them using aliases in a simpler and cleaner manner. Additionally, data-skipping enhancements accelerate federated queries over open table formats, demonstrating DuckDB's commitment to advance federated analytics capabilities over lakehouse table formats. --> Read More
๐ 21 Reasons to Consider Apache Hudi!
While Apache Iceberg has emerged as the leading open table format and continues strong into 2025, Apache Hudi is making its case by highlighting 21 unique reasons to consider Hudi over Iceberg and Delta Lake. If you're currently evaluating open table formats for your next project or company, this comparison is worth checking out. --> Read More
โ๏ธ Technical Deep Dive
๐ Parquet Data Skipping Mechanisms
This concise article provides an excellent overview of how pruning and data skipping are typically performed on Parquet files using metadata and statistics at various levels, including row groups and pages. The described approach aligns with DataFusion's implementation of the Parquet reading and pruning pipeline, but the general design principles apply broadly to other Parquet readers as well. --> Read More
๐ What's the Difference Between Arrow Flight, ADBC, and Arrow IPC?
Still confused about the differences between Arrow Flight Protocol, ADBC, and Arrow IPC? This article clearly explains these key technologies within the Apache Arrow ecosystem. It also highlights the advantages of using Apache Arrow as a data interchange format, demonstrating how it enables faster and more efficient data exchange compared to traditional methods. --> Read More
๐ Airflow Data Intervals: A Deep Dive
Understanding Apache Airflow's concept of time and intervalsโsuch as start time, execution time, and logical datesโcan be challenging. This article dives deep into the significance of data intervals in Airflow, clearly explaining their critical role in effectively scheduling and executing workflows. It illustrates how data intervals ensure each DAG run processes complete and accurate datasets, thus promoting idempotency and enabling reliable backfilling of historical data. --> Read More
๐ Academic Papers
๐ OLAP DBMS Archetype For The Next Ten Years?
An insightful paper co-authored by Michael Stonebraker and Andy Pavlo provides a comprehensive overview of the evolution of database management systems (DBMSs) since 2005. The authors predict the lakehouse architecture as the "OLAP DBMS archetype" for the coming decade, offering a unified infrastructure capable of supporting both SQL and non-SQL workloadsโa vision initially conceived but not fully realised during the Hadoop era. --> Read More
๐ผ Data Engineering Career
๐ How Will AI Disrupt Data Engineering?
This is one of the most compelling questions facing software and data engineers today. Tristan Handy, founder and CEO of dbt Labs, shares insightful perspectives on how artificial intelligence will impact data engineering roles and what the future might hold for data engineers. --> Read More
๐ช Skill Up
๐ Learning ClickHouse Fundamentals
ClickHouse is offering free 3-hour training sessions on ClickHouse fundamentals scheduled for April 22 and May 15. If you're interested, be sure to sign up! --> Read More
๐ฃ Vendors News & Announcements
๐ General Availability of Confluent's TableFlow
Confluent has announced the general availability of TableFlow, a technology that embraces stream-table duality. TableFlow simplifies the ingestion of event data from Kafka topics into structured lakehouse tables (currently supporting Iceberg and Delta Lake), abstracting away complex data engineering tasks. It manages the entire ingestion lifecycle, including schema evolution and table maintenance on Confluent Platform. --> Read More
๐ Materialize Self-Managed & Free Community Edition
The Materialize streaming database service provider has introduced two new offerings: a Self-Managed version and a Free Community Edition. The Self-Managed option allows deployment of Materialize in your own environment, providing greater control over performance and compliance. The Community Edition grants free access to Materialize's powerful features with specific usage limits, making it easier to test or run small-scale production workloads. --> Read More
๐ Case Studies
๐ Streaming Data Ingestion Into Cloud Data Warehouse
This insightful case study by Canva compares different data ingestion approaches into cloud data warehouses. It contrasts their previous micro-batch, file-based ingestion method (using AWS services like Firehose) with a new architecture using Snowflake's managed Snowpipe Ingestion service. The new solution achieves low-latency ingestion of billions of events daily, significantly reducing query latency to under 10 minutes and lowering overall cloud costs. --> Read More
๐ Using Key-Value Stores for Exactly-Once Streaming Ingestion
Event deduplication and exactly-once processing guarantees are crucial challenges in building reliable streaming pipelines, often complicated by pipeline failures and network issues. One popular solution involves using an external, fast key-value store to track and eliminate duplicate events during streaming. MyHeritage shares their experience implementing exactly-once processing using Spark Structured Streaming and a key-value store for deduplication. --> Read More
๐ฅ Conferences & Events
๐ Google Cloud Next '25 Keynotes
Google Cloud Next '25 took place earlier this month, and both the opening keynote and developer keynote are now available on YouTube. The conference again focused heavily on AI, highlighting Google's Gemini LLM foundation model, AI agents, the Agent Development Kit, and new features in Vertex AI. On the analytics front, the new capabilities of Google's Data Science Agent look particularly exciting. Be sure to check out these keynotes to stay updated on the latest trends and technologies from Google Cloud.
Opening Keynote:
Developer Keynote: