DLD #4 | Data Landscape Digest 🗞️

Rise of single-node engines, Postgres + DuckDB, Airflow alerting techniques, BigQuery continuos queries, top data engineering books and more

Nov 03, 2024

✨ Featured - The Rise of Single-Node Engines

In a recent blog post, Jordan Tigani, MotherDuck co-founder and former tech lead at Google BigQuery, highlighted that most companies don't actually deal with "Big Data." An analysis of half a billion sample queries run on Amazon Redshift revealed that over 80% of the queries processed less than 1 TB of data.

Even among the small percentage of businesses that do handle Big Data, the majority of queries (95%) are executed on smaller tables. In cases where companies have large datasets, with some tables exceeding 10 TB, about 96% of the queries still target smaller, likely aggregated tables of 100 GB or less, rather than the actual large tables.

A published paper by AWS also notes that most tables contain fewer than a million rows, with the vast majority (98%) having fewer than a billion rows.

These findings, combined with advancements in software and hardware technology that have significantly enhanced the processing capabilities of single-node systems, suggest that powerful emerging single-node compute engines like DuckDB could increasingly handle many non-big data use-cases. This would reduce the need for distributed processing frameworks like Spark, provided that the maturity and ecosystem support of these engines continue to evolve.

In a recent LinkedIn post, I shared a graph with some of these points, which sparked an engaging discussion among the data engineering community. The conversation featured a mix of opinions on the future of computing, and it's worth checking out.

💡Opinion

👉 The Infamous Rise of Notebook Engineers!

Daniel Beach has penned an intriguing article critiquing the growing use of notebooks in data engineering. He argues that many engineers misuse notebooks due to either a lack of technical skills or encouragement from vendors, like Databricks, promoting questionable practices. While notebooks are valuable for data analysts and scientists engaged in iterative data analysis, Daniel contends they are ill-suited for robust data engineering lifecycles. This misuse leads to poor coding standards, insufficient testing, and inadequate deployment practices. —> Read More

👉 The Analytics Personas in Business

Tristan Handy, the founder and CEO of dbt Labs, wrote an insightful piece about the key analytics personas in business. He critiques the common approach of treating the analytical process like an assembly line, which often fails to deliver significant insights and ROI. Handy suggests that the personas—primarily engineers, analysts, and decision-makers—should be seen as interchangeable "hats" that team members can wear when needed, while still maintaining their primary roles. This flexible approach fosters collaboration and enhances the overall effectiveness of the analytics process. —> Read More

📡 Open Source News

👉 Apache Airflow 2.10 Release

Apache Airflow 2.10 was released recently, introducing exciting new features such as support for multiple executors within a single Airflow environment. This allows users to assign different executors, like LocalExecutor and CeleryExecutor, to individual DAGs and even specific tasks. There are also numerous enhancements to Datasets and the UI, which are worth exploring. —> Read More

👉 Development of a New PostgreSQL DuckDB Extension

MotherDuck announced pg_duckdb, an open-source Postgres extension that embeds the DuckDB engine into the Postgres database for running analytical queries on Postgres data. This is a significant step towards easily transforming a popular OLTP system into an HTAP system using an embedded OLAP engine. The development looks promising, with multiple companies such as Microsoft, Neon, and Hydra joining the effort. The beta version has been released recently. —> Read More

👉 Ibis's Default Backend Change

The Ibis DataFrame library project announced that it will drop the Pandas and Dask backends in favour of making DuckDB its default backend. This decision is due to DuckDB's ease of installation, impressive speed, and strong support within the Python ecosystem. —> Read More

🛠 Practical Data Engineering

👉 Apache Airflow Alerting Techniques

The Google Data Analytics blog has provided a comprehensive overview of the alerting hierarchy in Apache Airflow, ranging from the top DAG level down to the individual task instance level. It details various alerting mechanisms that can be used to monitor the state of DAG runs and receive notifications about potential failures. The alerting techniques discussed are applicable not only to Google's Cloud Composer managed Airflow service but also to other Airflow deployments. —> Read More

👉 Best Practices for Optimising Airflow

The AWS blog has covered comprehensive strategies for optimising cost and performance in its Apache Airflow managed service, Amazon MWAA. Right-sizing remains crucial for achieving a balanced price-performance ratio in managed services. Amazon MWAA also supports auto-scaling, which can aid in this optimisation. The blog offers additional useful techniques for optimising DAG code to ensure that DAGs remain healthy, efficient, and scalable. These techniques can be applied to any Airflow deployment setup. —> Read More

Speaking of Airflow…

⚙️ Technical Deep Dive

👉 History and Evolution of Block Storage Services at AWS

Another fascinating story on the All Things Distributed blog explores the evolution of block storage services offered by Amazon Web Services (AWS). Written by one of AWS's leading engineers, it highlights key milestones in the development of Elastic Block Store (EBS), showcasing enhancements in performance, scalability, and continuous innovation. —> Read More

👉 The Internals of Apache Parquet

Vu has authored an insightful article on the internals of Parquet, the most popular cloud file format for data lakes. For those working with data lakes, understanding the design and architecture of common serialisation formats is invaluable for optimising storage and queries. —> Read More

👉 The Future of Distributed Systems and Their Storage Backend

A great article by Colin Breck on the future of distributed systems and highlighting current challenges, and major trends such as acceleration of object store adoption as the main storage backend abstraction for many analytical and transactional database systems. —> Read More

💪 Skill Up

In a recent LinkedIn post, I shared my top book recommendations for learning data engineering fundamentals. The feedback was overwhelmingly positive, with many comments emphasising the importance of selecting a good book and focusing on mastering the fundamentals. I have personally read all these books in recent years and have gained a lot from them.

👉 DataCamp’s Free Week

DataCamp is offering free access to its entire platform and all courses for a week, from November 4 to 10. This is a great opportunity to explore their courses and enhance your skills in the coming week! —> Read More

🔎 Case Studies

👉 Cost-effective Data Analytics Using Deterministic Sampling

Meta has shared valuable insights into its approach for achieving cost-effective data analytics through the use of deterministic sampling. This strategy is designed to balance the cost versus value trade-off, especially as data volumes and computation costs continue to rise exponentially. By employing deterministic sampling, Meta aims to reduce the overall cost and complexity of analytics without compromising the quality of insights. —> Read More

👉 Uber's New Declarative Batch ETL Framework

Uber has developed a modular declarative batch ETL framework called Sparkle, which leverages Apache Spark as the compute engine. Sparkle simplifies and standardises ETL pipeline development by allowing users to focus on expressing business logic as a sequence of transformation modules in SQL or Java/Scala/Python. It includes embedded unit testing and marks Uber's transition of all its batch ETL pipelines from Hive to Spark in 2023. —> Read more

👉 Self-service Kafka Platform Development Journey

Doordash has shared their journey in developing a self-service Kafka platform, aimed at addressing the challenges of managing Kafka infrastructure efficiently. This initiative was driven by the need to simplify the management of Kafka topics and resources, which was previously hindered by the use of low-level configuration management tools like Terraform. —> Read more

📣 Vendors News & Announcements

👉 Continuous Queries on Data Warehouse Systems

Google BigQuery has introduced a significant new feature called BigQuery continuous queries, currently available in Preview. This feature transforms BigQuery from a batch system into an event-driven streaming pipeline, leveraging the concept of Stream-Table Duality. It allows for the continuous ingestion of new events as data is loaded into BigQuery, enabling use cases like event-driven data processing, continuous record replication to a pub/sub queue or other streaming storage systems, real-time ML model integration, and Reverse ETL use cases. —> Read More

The Confluent blog has also published an article on leveraging this feature to stream data from BigQuery into the Confluent platform.

👉 New Google Managed Apache Kafka Service

Google has also announced a new managed service, Google Cloud Managed Service for Apache Kafka. This service abstracts the complexities of deploying and managing a Kafka cluster, offering features like security management, full management of brokers and storage, and automatic horizontal and vertical scaling. It also includes automatic storage tiering and lifecycle management to offload cold data to unlimited cloud storage. —> Read More

👉 Introduction of Conditional Writes on AWS S3

AWS recently announced the introduction of "Conditional Writes" on S3, marking a significant advancement in enhancing the reliability and efficiency of data operations, especially for distributed applications. This feature ensures that writes occur only if certain conditions are met, reducing the risk of unintentional data overwrites. It allows multiple clients to read and write to the same object without conflicts or concerns about overwriting each other's data. —> Read More

🎥 Conferences & Events

👉 Thinking Like an Architect

Gregor Hohpe delivered an insightful presentation titled "Thinking Like an Architect" at QCon London 2024. If you're interested in or work with data architecture, his talk provides valuable insights worth exploring. —> Watch

👉 Carnegie Mellon University's Intro to Database Systems Course - Fall 2024

The Fall 2024 session of Carnegie Mellon University's renowned "Intro to Database Systems" course was commenced in August. You can follow along with the course through recorded lectures available on their YouTube channel.

Practical Data Engineering