DLD #3 | Data Landscape Digest 🗞️

Open Catalog War, Latest Apache Kafka and Apache Flink Releases, Airflow Trigger Rules, Lakehouse File Formats and More.

Sep 22, 2024

✨ Featured - The New Open Catalog War

In July this year, Snowflake open-sourced its Polaris Catalog under the Apache 2.0 license, with plans to submit it to the Apache Incubator program. Polaris is a catalog service designed for Apache Iceberg but can extend to other major open formats as well.

The question is: why does Apache Iceberg, which already has its own metadata layer, need a separate catalog service?

While Delta Lake, Apache Hudi, and Apache Iceberg open table formats provide their own metadata, each query engine (such as Spark, Flink, Presto or Trino) must perform separate integrations for tasks like schema discovery and data operations.

An open unified catalog service like Polaris simplifies this by streamlining multi-engine interoperability. It also offers enhanced features like improved search, data discovery, tagging, and governance, including access control through a unified, open, and vendor-agnostic interface compatible with various storage engines.

Polaris has the potential to become the standard catalog service for data lakehouse platforms, much like Hive Metastore was for Hadoop-based systems. Currently, catalog options include proprietary tools like AWS Glue Catalog and Databricks’ Unity Catalog, which was also open-sourced in June 2024.

Snowflake’s decision to open-source Polaris may have been influenced by Databricks' move to open source Unity Catalog. While some may see this as a new “catalog war” driven by marketing strategies, as noted by Chris, there’s still hope that these moves will lead to production-ready, open-source catalog services that can finally provide an alternative solution to Hive Metastore after all these years.

📡 Open Source News

👉 Seamless Integration of dbt and Airflow

Building and scheduling data pipelines using dbt models and workflow orchestration tools like Airflow has become a standard practice in data engineering for running transformation workflows. In response to the growing demand for seamless integration between the two systems, Astronomer developed a Python package called Cosmos. This package simplifies running dbt models within Airflow DAGs through a new Airflow task type, DbtDag. The latest version, 1.5.1, was released on July 17, so if you're using dbt and Airflow, it's definitely worth checking out. —> Read More

👉 What's New in Apache Kafka 3.8.0

The release of Apache Kafka 3.8.0 was recently announced, bringing several new features and improvements. This post on the official Apache Kafka site provides an overview of key updates, including support for compression levels, a new consumer rebalance protocol, and re-bootstrapping capabilities. —> Read More

👉 Apache Flink 1.20: New Features and the Road to Flink 2.0

Apache Flink 1.20 was also recently released, and the Confluent blog highlights the major improvements and features in this update. Notable enhancements include improvements to the bucketing feature for Flink SQL tables, allowing users to specify the number of buckets in the DISTRIBUTED BY clause, and the introduction of Flink SQL Materialised tables, which are automatically refreshed in the background as data streams in. Additionally, there are various operational improvements. According to reports, this may be the last minor release before Flink 2.0. —> Read More

🛠 Practical Data Engineering

👉 Mastering Airflow Trigger Rules

Astronomer, a managed Airflow service, has provided an overview of Airflow trigger rules with a visual guide to help new Airflow developers better understand and apply the right trigger rules in their DAGs. For new engineers, grasping all the trigger rules can be challenging, but it's a crucial aspect for effectively managing dependencies between upstream and downstream tasks. —> Read More

👉 A Radical Simplicity Approach to Data Engineering

The tds blog recently shared a great post about the trade-off between simplicity and functionality in software projects, including data engineering. The author advocates for a philosophy of "Radical Simplicity," where simple, straightforward solutions are prioritised over complex ones. This resonates with me, as I believe complexity should only be introduced when absolutely necessary. —> Read More

👉 Hands-On Guide: Installing and Integrating Polaris OSS

If you're interested in installing and testing the latest Apache Polaris open-source release, Dremio has published a hands-on tutorial that guides you through the installation process and integration with Apache Spark and Apache Iceberg. —> Read More

⚙️ Technical Deep Dive

👉 Parquet vs ORC: Choosing the Right Format for Data Lakehouse

In a recent Apache Hudi blog post, the author compares Parquet and ORC, two of the most popular serialisation frameworks for data lakes and open table formats. The post argues that Parquet delivers better performance for read-heavy, complex analytical use cases where query performance is crucial, while ORC offers a more balanced approach for both read and write performance with superior compression, making it a better fit for general-purpose data storage in Hudi. —> Read More

👉 Overview of Amazon MSK Tiered Storage

Amazon recently published a post, explaining how the new Kafka tiered storage in Amazon MSK (Managed Kafka Service) enhances scalability and resiliency. With the new decoupled storage and compute architecture, the system benefits from faster broker recovery, improved load balancing, and virtually unlimited scalability. —> Read More

💬 Community Discussions

There have been several widely discussed threads including link and link on Reddit last month about data engineering roles and the challenges of applying for Data Engineering jobs. Based on comments from both candidates and hiring managers (including CTOs), the situation seems two-fold.

On one hand, the market is flooded with unqualified candidates—often with minimal skills from low-quality bootcamps and online courses—making it hard for companies to find qualified engineers. On the other hand, there are vague and unclear job descriptions, leaving data engineers unsure of what is expected of them once hired. This has created frustration on both sides.

🔎 Case Studies

👉 Evolution of Apache Flink Architecture at Airbnb

Airbnb published a post detailing the evolution of their Apache Flink architecture. Initially, they deployed Flink jobs on Hadoop YARN with Airflow as the scheduler in 2018. Today, they’ve moved to deploying Flink jobs on Kubernetes, eliminating the need for a job scheduler. —> Read More

👉 Pinterest's Adoption of StarRocks for Real-Time Analytics

Pinterest shared their recent adoption of StarRocks, a real-time OLAP engine, for their real-time analytics platform. They chose StarRocks for its features like support for standard SQL, joins, sub-queries, and materialized views—capabilities not readily available in other real-time OLAP engines like Druid. Back in 2021, Pinterest published details about managing a large Druid fleet with 2,000 nodes in a multi-cluster setup. —> Read More

👉 The Rise of New Real-time OLAP Engines

On that note, while Apache Druid, Pinot, and ClickHouse have dominated the open-source real-time OLAP space in recent years, we’re now seeing increased adoption of newer engines like Apache Doris and StarRocks, the latter being a fork of Doris. For a detailed comparison between StarRocks and Doris, check out this blog post from StarRocks Engineering.

👉 Uber’s Hadoop-to-Cloud Migration

Uber, which operates one of the largest on-premise Hadoop clusters, has recently begun migrating to the cloud, starting with a key architectural shift—replacing the HDFS file system with Google Cloud Storage, while still running the rest of their stack on IaaS. One of the challenges in migrating from Hadoop to the cloud is transitioning Hadoop’s security features, such as delegation tokens and Kerberos authentication, to Google Cloud’s token-based security. Uber discusses how they tackled these security migration challenges in this article. —> Read More

📣 Vendors News & Announcements

👉 Fivetran Integration with Snowflake’s Polaris Catalog

Just days after Snowflake open-sourced the Polaris catalog service on GitHub, Fivetran, a leading SaaS provider for data integration, announced its upcoming integration with the newly open-sourced Polaris data catalog. This integration aims to develop a managed catalog solution for Fivetran’s Managed Data Lake Service. —> Read More

👉 Databricks LakeFlow Connect for Automated Data Ingestion

Databricks announced the public preview of LakeFlow Connect, an automated incremental data ingestion service for sources like SQL Server and Salesforce. Built on Delta Live Tables, LakeFlow Connect enables incremental data ingestion using CDC (Change Data Capture). This marks another step by major vendors toward automating data engineering tasks. —> Read More

👉 Databricks Lakehouse Federation Across AWS, Azure, and GCP

Databricks also announced the general availability of Lakehouse Federation in Unity Catalog across AWS, Azure, and GCP last month. This mirrors the strategy of other top cloud vendors to offer a unified analytical platform with centralised data discovery and governance, providing a unified view of enterprise data across multiple storage engines and cloud platforms. —> Read More

👉 ClickHouse Acquisition of PeerDB for Real-Time Postgres Ingestion

ClickHouse, Inc. announced the acquisition of PeerDB, a provider of Change Data Capture (CDC) for Postgres databases. This move aims to integrate and streamline real-time data ingestion from transactional databases like Postgres into the ClickHouse OLAP engine. —> Read More

🎥 Conferences & Events

👉 Kafka Current 2024 Key Notes

The two-day Kafka Current 2024 event, organised by Confluent, took place last week in Austin. Keynotes from both Day 1 and Day 2 have already been published on YouTube: Keynote Day 1 | Keynote Day 2

👉 Open Source Data Summit Virtual Conference

The Open Source Data Summit Virtual Conference will be held on October 2nd. If you're interested, you can register for free at opensourcedatasummit.com. The event will feature numerous discussions on data lakehouses and the role of open table formats in modern data architectures.

Practical Data Engineering