DLD #2 | Data Landscape Digest 🗞️
Curated Knowledge on Data Engineering Landscape
✨ Featured - Netflix Maestro Workflow Engine
The crowded field of open-source engines has just welcomed a new player: Maestro, recently open-sourced by Netflix!
Netflix asserts that Maestro is a highly scalable and flexible scheduler capable of managing large-scale heterogeneous workflows, including ML training and data pipelines. It supports flexible execution logic, such as Docker images and notebooks, and accommodates various workflow patterns, including cyclic and acyclic (DAGs).
One of its standout features is the foreach pattern, which is particularly useful for repetitive tasks like ML model training and data backfilling—something that would typically require separate job runs on a scheduler like Airflow to backfill daily ingested source data. Maestro also offers multiple domain-specific languages (DSLs) for defining workflows declaratively using YAML files, a feature that would need to be custom-built on top of Airflow.
Since its open-source release in July, the project has already garnered 3,000 stars on Github. Netflix had previously covered the internals and use cases implemented with this engine, and their latest blog post provides a comprehensive overview of Maestro's features and supported workflow patterns.
Additionally, this blog post provides a thorough comparison between Airflow and Maestro, complete with practical examples and code snippets.
💡 Trends & Insight
👉 The State of Modern Data Stack in 2024
In 2024, there has been much discussion about the decline of the Modern Data Stack (MDS). Concerns have been raised about the economics of the Modern Data Stack, and the term itself is being recycled, much like previously hyped concepts such as "Big Data". Some experts believe that many MDS startups are doomed to extinction. As Matt Turck pointed out, the Modern Data Stack was largely a marketing concept and an alliance among several startups across the data value chain. In a recent blog post, Ananth explores the history, decline, and the emerging post-MDS era. —> Read more
👉 Embracing 'Bring Your Own Compute' with DuckDB
An interesting discussion took place between MotherDuck’s co-founder and CEO and Fivetran’s co-founder and CEO about the future of big data and single-node or laptop-sized analytics using DuckDB. With advancements in hardware, we might witness a new shift towards local execution, where computing is fully or partially pushed to the user’s machine, introducing the concept of Bring Your Own Compute for analytics. —> Watch the interview
📡 Open Source News
👉 Apache Kafka 3.8 Release
Apache Kafka 3.8 has been Released. Confluent blog provided a summary of the the new features and improvements in this release. —> Read more
👉 Delta Lake 4.0 New Features
Delta Lake 4.0 Preview was announced in June. A blog post highlights the new important features such as Change Data Feed (CDF), Liquid Clustering and new monitoring features of this release with some practical examples. —> Read More
👉 DAG Factory Project Takeover by Astronomer
Astronomer has announced taking over the open-source project DAG Factory, a Python library for authoring Airflow DAGs declaratively using YAML configuration files. Providing a thin no-code abstraction layer on top of Airflow has become a common practice among tech companies to reduce the engineering effort required to create DAGs, standardise pipeline creation for common use cases like data transformation, and make the process more self-service. —> Read more
🛠 Practical Data Engineering
👉 Consistent Data Modeling and Naming Conventions
Implementing a consistent data modeling framework, such as standardised naming conventions, is crucial to maintaining a healthy data platform and ensuring long-term scalability, even when there is turnover among engineers. Mike discusses some of the key aspects and best practices for data warehouse modeling, including effective naming conventions for tables and schemas. —> Read more
👉 State of CI/CD for Data Pipelines
LakeFS published a comprehensive overview of implementing Continuous Integration/Continuous Delivery (CI/CD) for data pipelines, focusing on the Write-Audit-Publish (WAP) ingestion pattern. The article explores various options and tools available in the market, offering insights into how to effectively integrate CI/CD practices into data workflows. --> Read more
👉 Data Reconciliation Techniques and Best Practices
Datafold has published a three-part series on data reconciliation, a crucial subset of data quality. The series covers use cases, techniques, challenges, and best practices for performing data reconciliation across data sources and targets, with the goal of ensuring data accuracy and completeness. —> Read more
👉 dbt Beyond the Marketing Hype
There have been many blog posts and discussions about the hype dbt has generated over the past few years. The author of this blog post takes a different approach, discussing the challenges of performing data transformation in data warehouses and how dbt can address them. It starts with real problems and then explores how tooling, specifically dbt, provides solutions—rather than starting with the tool (because it's popular and everyone is talking about it) and then searching for problems it can solve. The post also offers a clear definition of what dbt actually does:
dbt works by abstracting common data-warehouse patterns into config-driven automation and providing a suite of tools to simplify SQL transformations, tests, and documentation.
⚙️ Technical Deep Dive
👉 Evolution of Debezium's Internal Engine
While in many streaming CDC data architectures, Debezium plugins are primarily used within the Kafka Connect framework and runtime, it is also possible to run Debezium connectors outside the Kafka ecosystem. This can be done by embedding the Debezium engine in internal applications or by using the standalone Debezium Server, which is now a separate project on GitHub. This Debezium blog discusses the evolution of Debezium's internal engine, starting with the initial EmbeddedEngine
implementation, which was mainly built for testing, and the new AsyncEmbeddedEngine
, which addresses the shortcomings of the previous implementation. —> Read more
👉 A guide on Concurrency Levels of Apache Airflow
One of the most confusing aspects of the Apache Airflow engine, especially for newcomers, is how concurrency is applied at different levels, such as the Airflow scheduler, DAG level, and task level, and how their combination can impact overall workflow performance. The configuration parameters in the config file of early released versions added to this confusion, with names that often seemed unrelated. In a recent blog post, Google provides a comprehensive overview of the various concurrency levels in Airflow, with a particular focus on its managed Airflow service. —> Read more
👉 Apache Airflow Software Architecture
A helpful guide posted by Apache Airflow's blog with visuals that illustrates the key underlying components of the Apache Airflow software architecture and how they interact within the system. —> Read more
Speaking of Airflow, do you...!?
👉 Bringing GenAI and LLMs to Flink Streaming Pipelines
Confluent blog provided an overview of a new Flink AI feature, which allows streaming data pipelines to invoke AI models, including generative AI (GenAI) large language model (LLM) endpoints (such as OpenAI and Google Vertex AI), directly from Flink SQL statements. This enables tasks like AI model inference, regression, and classification to be seamlessly integrated into real-time data processing workflows. —> Read more
🔎 Case Studies
👉 Pinterest's Migration from HBase to TiDB
Pinterest shared their journey of replacing HBase storage with a modern, scalable open-source database system that meets their requirements for reliability, performance, tunable consistency, and robust CDC support. They ultimately chose TiDB as the solution. We are seeing more stories of companies exploring alternatives to HBase due to its limitations and maintenance overhead. —> Read more
👉 Slack’s Migration to EMR 6
Slack discusses their migration from EMR 5 with Spark 2 to EMR 6 with Hive 3 and Spark 3 on AWS, highlighting the performance and reliability improvements achieved in their data pipelines, which are developed using Apache Spark and scheduled on Airflow. —> Read more
💬 Community Discussions
There was a recent discussion on Reddit on how fast data engineering is progressing. The consensus among most commenters is that while tools, storage systems, and processing frameworks may evolve rapidly, the fundamentals of data engineering remain consistent. These fundamentals include data integration, data modeling, and the processes of extracting, transforming, and loading data (ETL).
For aspiring data engineers, the key takeaway is to invest time and effort in mastering these basics and fundamentals rather than focusing solely on becoming an expert in specific tools. While vendors are striving to automate the data engineering lifecycle as much as possible (as seen with the latest Databricks offerings), a strong understanding of the fundamentals will always be valuable and help you stand out in the field.
📣 Vendors News & Announcements
👉 Snowflake's New Cortex Search Feature
Snowflake announced a new feature called Cortex Search (currently in Public Preview) in July 2024. This search service is designed for unstructured data, such as text, and enables enterprises to deploy Retrieval-Augmented Generation (RAG) applications using Snowflake, allowing them to customise generative AI applications with proprietary data. —> Read more
👉 Databricks Mosaic AI Model Training
Around the same time, Databricks announced the support for Mosaic AI Model Training, which streamlines the fine-tuning of general-purpose open-source LLM and GenAI models, such as Llama 3 and Mistral, using enterprise data. Databricks recommends a new approach for training LLM models with enterprise data called Retrieval Augmented Fine-tuning (RAFT), which combines both Retrieval-Augmented Generation (RAG) and model fine-tuning. —> Read more
👉 Release of Confluent Platform 7.7
Confluent announced the release of Confluent Platform 7.7, built on Apache Kafka 3.7. This update introduces significant features, including Confluent Platform for Apache Flink, a fully managed and serverless stream processing service (currently in Limited Availability), as well as a self-managed HTTP Source connector for ingesting data from external APIs. —> Read more