State of Open Source Real-Time OLAP Systems 2025
Overview of Major 2024 Trends and Emerging Technologies Shaping 2025
This is the forth part in the Data Landscape Trends 2024-2025 series, focusing on the state of the open-source real-time OLAP database systems.
In the first part, we explored the evolution of the BI stack; the second part examined the rise of single-node processing engines; and the third part discussed the evolution of zero-disk architecture.
Introduction
Real-time OLAP database systems have undergone significant development since their early development in the 2010s by companies like Yandex, Metamarkets and LinkedIn, which introduced systems such as ClickHouse, Druid, and Pinot to address the limitations of traditional MPP engines.
Initially designed for sub-second analytics on massive volumes of append-only web logs and clickstream data, these specialised databases have expanded their capabilities over the years.
New entrants like Apache Doris and StarRocks have joined the scene, aiming to bridge the gap between traditional OLAP architectures and modern MPP systems.
This article provides a comprehensive overview of the evolving real-time OLAP ecosystem, examining the current state and advancements of major open-source OLAP engines. We'll explore:
Background on real-time OLAP database systems
An analysis of the current landscape and leading open-source products
Emerging trends in 2024
Major features and capabilities introduced by each product in 2024
An assessment of key open source metrics including popularity, development activity, and community engagement
A comprehensive comparison of architectural approaches and core features
Recommendations for choosing and implementing these systems
Real-time OLAP Database Systems
Real-time OLAP engines are specialised databases designed to deliver sub-second analytics performance.
They operate similarly to cube servers in traditional BI solutions, pre-computing metrics across various dimensions to enable real-time drilling and slice-and-dice analysis of data.
These engines achieve exceptional query performance through a combination of optimised storage and sophisticated indexing during data ingestion. Their architecture treats data as immutable, optimised primarily for append-only operations and segment-level (i.e. data chunks) replacements.
Trade-offs
While this storage paradigm is particularly well-suited for log and event workloads in denormalised format, it comes with certain trade-offs: inflexible data models, limited join capabilities, and higher ingestion latency.
Unlike modern MPP-based scalable storage systems such as Redshift, BigQuery, and Snowflake—which couldn't deliver sub-second queries due to architectural limitations—these systems prioritised query performance over conventional database features like table joins, ACID guarantees, and row-level mutations.
In recent years, ClickHouse and newer systems like Apache Doris and StarRocks are moving beyond the traditional real-time OLAP storage model by introducing support for mutable data operations and complex queries typically associated with data warehouse systems.
Current Landscape and Major Products
The following graph illustrates the development timeline of major open-source real-time OLAP engines, highlighting when each system was open-sourced and, where applicable, donated to the Apache Software Foundation.
The real-time OLAP landscape features several established products, each with its own origin story. ClickHouse was initiated at Yandex in 2010, followed by Apache Druid (developed by Metamarkets in 2011), Apache Kylin (created by eBay), and Apache Pinot (originated at LinkedIn in 2013).
More recently, the ecosystem expanded with two significant additions: Apache Doris and StarRocks. Baidu developed and open-sourced Apache Doris in 2017-2018.
StarRocks later emerged as a fork of Doris, led by former Apache Doris PMC members who sought to address what they perceived as gaps in the original project's roadmap for building modern real-time analytics systems.
Divergence From Traditional OLAP Model
While Apache Doris and StarRocks are classified as real-time OLAP systems, they represent a divergence from traditional OLAP approaches.
Rather than relying on immutable storage models and heavy pre-processing ad indexing methods with cube semantics, these systems bridge the gap between conventional OLAP designs and modern MPP (Massively Parallel Processing) architectures like Amazon Redshift and Google BigQuery.
Through native support for both bulk and row-level updates, alongside complex join capabilities, these systems are aiming to offer a hybrid solution that combines real-time OLAP performance with the versatility of general-purpose analytics platforms.
The ecosystem also includes proprietary solutions such as Rockset and Kinetica, though these fall outside the scope of this open-source focused analysis.
GitHub Repository Trends
Open-source projects are often evaluated based on key metrics such as repository stars, download counts, contributor activity, and repository engagement, including commits, releases and issues logged and resolved.
As part of my work and passion for the open-source ecosystem, I run my own little analytics platform to collect, store, and analyse GitHub events to track year-over-year trends in data engineering-related projects.
Project Popularity
By using GitHub's Watch (Star) and Fork events as indicators of community interest, my analysis of 2024 data reveals that ClickHouse stands out as the clear leader.
With approximately 6,400 new stars and a significantly higher number of forks than other projects, ClickHouse outperformed its competitors, more than doubling the attraction of Apache Doris, which ranks second.
Number of Repository forks in 2024 follow similar trend:
Apache Doris and StarRocks are having a close race to capture the attention of the industry and gain adoption as leading hybrid real-time OLAP and Data Warehouse engines.
In contrast, Apache Pinot and Apache Druid are struggling to keep pace with the likes of ClickHouse and Doris/StarRocks. The lack of significant innovations from these projects in 2024 might hinder their ability to further capture market interest.
Code Activity
Development activity in 2024 showed interesting patterns across projects. Apache StarRocks and Apache Doris led in pull request activity, each processing approximately 30K pull requests (opened plus closed), while ClickHouse maintained a strong third position.
However, ClickHouse demonstrated the highest code commitment activity, showing the most consistent push frequency and merge operations. StarRocks and Apache Doris followed with high activity levels, ranking second and third respectively.
In contrast, 2024 saw lower levels of development activity for Druid, Pinot, and Kylin. While Druid and Pinot are mature products that naturally might require less frequent updates, their reduced code activity, along with Kylin's, still raises concerns about potential project stagnation.
This trend is particularly particularly evident in Apache Kylin, which recorded only 150 pull requests and 85 code pushes for the entire year, indicating significantly diminished development momentum.
As a result, Apache Kylin will be excluded from the majority of this study moving forward.
User Engagement
Repository issue activity—both opened and resolved—serves as a critical indicator of project health and community engagement in open-source projects.
Analysis of 2024 data shows ClickHouse leading in this metric with the highest volume of user-reported issues and resolutions, with StarRocks and Doris showing strong activity levels in second and third positions respectively.
Project Collaboration
ClickHouse maintains the largest contributor base with over 1,600 committers since inception, yet Apache Doris and StarRocks demonstrated superior community growth in 2024, in attracting new collaborators.
The contribution metrics also reveals significant regional concentration, with StarRocks and Doris receiving substantial backing from major Chinese technology companies including Baidu, Tencent, and Alibaba. Over 80% of contributions to these repositories originate from China, reflecting strong regional investment in these platforms.
Adoption & Installations
Industry adoption and usage can be measured through download and installation metrics, with Docker Hub downloads serving as a particularly reliable indicator for database systems.
In this metric, ClickHouse demonstrates exceptional market penetration with over 100 million downloads.
Apache Pinot and Apache Druid show substantial adoption with 10 million and 5 million downloads respectively. The newer entrants, StarRocks and Doris, have achieved encouraging early traction with 500K and 100K downloads respectively.
Major 2024 Trends
1. Adoption of Decoupled Storage and Compute Architecture
Distributed real-time OLAP systems have traditionally followed the shared-nothing architecture common to modern MPP-based storage systems.
However, the industry's shift towards decoupled storage and compute models has prompted Real-time OLAP engines to embrace "zero-disk architecture".
This architecture which leverages deep storage solutions like HDFS and S3 as the primary persistence layer, offering enhanced scalability and flexibility while reducing operational costs.
In 2024, StarRocks and Apache Doris incorporated this architectural approach into their platforms.
ClickHouse had earlier laid the groundwork for this transition with its S3-backed MergeTree tables in version 21.8 (August 2021), enabling direct table storage in Amazon S3 or compatible object storage, and has since expanded its cloud offerings around this model.
2. Federated Analytics
The rising adoption of data lake and lakehouse architectures has prompted major analytical storage systems to pursue seamless integration with open table formats including Hudi, Iceberg, and Delta Lake.
Beyond these table formats, the platforms are extending their capabilities to support both read and write operations for industry-standard data lake file formats such as Parquet and ORC.
Modern real-time OLAP engines have embraced this trend, evolving beyond their traditional roles into unified analytics platforms.
ClickHouse, Apache Doris, and StarRocks have implemented native federation capabilities, enabling direct querying across diverse data sources—including data warehouses, data lakes, and open table formats—without the traditional requirements of data ingestion or replication.
3. Real-Time Data Warehouse
A significant trend in the ecosystem is the drive towards delivering "real-time data warehouse" capabilities.
This advancement enables ad-hoc analytical queries without pre-aggregation, supporting complex joins across multiple datasets—a functionality that has traditionally challenged real-time OLAP storage models.
Doris and StarRocks are leading this transformation by combining the strengths of flexible MPP engines (exemplified by Redshift, Snowflake, and BigQuery) with the real-time analytical capabilities of OLAP systems. Their hybrid approach achieves a balance of speed, flexibility, and scalability.
ClickHouse has also embraced this direction, enhancing its core engine to support broader OLAP workloads through improved update operations and enhanced join capabilities.
Major Features and Improvements Introduced in 2024
Below is an analysis of major features and enhancements introduced by each platform in 2024:
ClickHouse
Refreshable Materialised Views: Enables periodic on-demand full recomputation and refresh of materialised views.
Remote File Caching: Enhances efficiency for distributed workloads by caching remote files, significantly reducing access times.
JSON Data Type: Introduced in ClickHouse v24.8, this new data type improves handling of semi-structured data.
StarRocks
Shared-Data Clusters: Introduced in version 3.3, providing a decoupled storage and compute architecture for zero-disk capability using deep storage as the primary persistence layer.
Pipe Service: Automates continuous loading of data files (e.g., Parquet) from deep storage services like S3 and HDFS into the StarRocks engine.
Unified Catalog: Enables query federation over lakehouse tables, supporting direct queries on open table formats, with integration with Hive Metastore or AWS Glue for data discovery.
Apache Doris
Decoupled Storage and Compute Architecture: Introduced in Doris 3.0.3 (December 2024), supporting S3-compatible object storage for cloud-native data persistence.
Data Write-Back: Enables DDL and DML functions such as creating tables and writing data to Hive and Iceberg tables directly through Doris.
Transaction Support: Adds CRUD operations like
INSERT INTO SELECT
,DELETE
, andUPDATE
.Materialised Views: Introduced asynchronous and multi-table materialised views.
Semi-Structured Data Support: Enhanced with the new VARIANT data type.
Expanded Data Lake Integration capabilities.
Apache Druid
Version 30.0 Enhancements: Better ingestion experiences for Amazon Kinesis, Apache Kafka, Delta Lake, and improved integrations with Google Cloud Storage and Azure Blob Storage.
Centralised Schema Management: Speeds up schema operations and cluster startup by gathering segment metadata.
Version 31.0 (Dart Query Engine): Introduced a new query engine supporting complex workloads like large joins and high-cardinality
GROUP BY
, expanding Druid's capabilities into MPP (Massively Parallel Processing) territories.
Apache Pinot
Multi-Stage Query Engine: Enhanced for better performance and scalability.
Upsert and Compaction Improvements: Optimised for data ingestion workflows.
Semi-Structured Data Support: Improved handling of JSON data.
Delta Lake Integration: Added support for the Delta Kernel library, enabling integration with Delta Lake.
Major Vendor announcements and features
The following table lists key developments across SaaS vendors supporting these open-source platforms, including major announcements, strategic partnerships, and product expansions:
Features Comparison
The real-time OLAP systems has been extensively documented, with numerous comparisons of the top three contenders—ClickHouse, Druid, and Pinot.
Roman Leventov's comprehensive 2018 analysis stands as a notable reference point, detailing their architectural differences and capabilities. However, these platforms have evolved significantly in recent years.
The following sections provide a detailed comparison of these engines across key architectural and functional categories, reflecting their current capabilities and distinctions.
1. System Architecture
These platforms all build upon a distributed shared-nothing architecture. ClickHouse, Doris, and StarRocks have extended this model to support shared-storage (zero-disk) configurations through decoupled storage and compute capabilities.
From a software architecture perspective, ClickHouse, Doris, and StarRocks adopt a more simplified architecture compared to the relatively complex implementations of Apache Druid and Apache Pinot.
Doris and StarRocks are particularly notable for their self-contained design, operating independently of external system dependencies.
2. Data Architecture
All products implement columnar storage systems, with Doris and StarRocks recently extending their capabilities to support row-based storage modes.
Druid and Pinot distinguish themselves through their time-series-oriented data model, which leverages timestamp columns as primary partition fields. These two engines also excel in supporting hybrid workloads (real-time and batch) with flexible segmentation granularity.
Regarding transactional capabilities, Doris and StarRocks lead the ecosystem in ACID compliance, followed closely by ClickHouse. These platforms provide good support for primary key uniqueness, atomicity and concurrency control.
3. Query and Materialised Views
Materialised view support is currently limited to ClickHouse, Doris, and StarRocks.
These engines offer both synchronous materialised views which automatically update when source table data changes, and asynchronous materialised views which can be recomputed on demand or through scheduled jobs.
Doris and StarRocks extend this functionality with advanced features including multi-table materialised views and automatic query rewriting capabilities.
They also demonstrate superior complex join support, while ClickHouse provides moderate join capabilities. Druid and Pinot, however, are restricted to joins with small, dedicated dimension tables.
In terms of query processing capabilities, ClickHouse, Doris, and StarRocks maintain their leadership through sophisticated optimisation techniques, including cost-based optimization (CBO) and vectorised processing.
4. Data Ingestion
Real-time OLAP engines support two fundamental data ingestion modes: batch ingestion for processing data from sources like data lakes, and streaming ingestion for handling continuous data flows from platforms such as Apache Kafka.
These engines primarily employ ETL (Extract, Transform, Load) for data loading, performing transformations during ingestion to align with their immutable data model.
This approach particularly suits Druid, Pinot, and ClickHouse, which are optimised for denormalised and pre-aggregated data. The alternative ELT (Extract, Load, Transform) approach, which prioritises raw data loading, sees limited adoption in these systems.
For mutable data operations, Pinot, ClickHouse, Doris, and StarRocks provide support for upserts and primary-key-level row deduplication.
Using External Compute Frameworks
While offering native batch ingestion capabilities, these platforms integrate with external processing frameworks—including Hadoop (Druid, Pinot), Spark, and Apache Flink—to offload complex data transformation and computation during ingestion.
They also support push-based ingestion through frameworks like Kafka Connect and Flink CDC Connect, which enable data ingestion via custom-built connectors.
External storage support
All platforms effectively handle data lake file formats (CSV, ORC, Parquet), with ClickHouse, Doris, and StarRocks further extending support to major open table formats in data lakehouse architectures—a capability where Druid and Pinot currently lag.
In terms of log-based Change Data Capture (CDC) ingestion, ClickHouse and StarRocks offer comprehensive support, though ClickHouse's CDC solutions typically require subscription-based ingestion services.
5. Query Federation & Interfaces
As highlighted earlier, A key advancement in modern OLAP systems is their integration with external storage systems, enabling direct data querying without prior ingestion.
Among the five open-source platforms, ClickHouse, Apache Doris, and StarRocks lead with advanced query federation capabilities. These platforms support major external storage systems including data lakes, lakehouses, and DBMSs.
They enable external table definitions over files and directories in data lakes hosted on HDFS and S3-compatible platforms, and facilitate direct querying of open table formats like Hudi, Iceberg, and Delta Lake. Doris and StarRocks extend this support to include Apache Paimon.
Data Write-back
For data export and write-back operations, Doris and StarRocks support writing data to Hive and Iceberg, while ClickHouse mainly supports MySQL and Postgres.
Doris and StarRocks also enhance their data discovery capabilities through integration with external metadata services such as Hive Metastore and AWS Glue.
In terms of broader ecosystem integration, ClickHouse, Druid, and Pinot demonstrate comprehensive support. These major compute frameworks including Spark, Presto, Trino, and Hive, have integrated with these engines enabling direct data access.
All platforms provide moderate support for BI tool integration.
Recommendations
Each product in the current OLAP market offers unique advantages derived from its original design goals and subsequent feature developments.
While this general recommendation guide serves as a starting point, a comprehensive evaluation across multiple criteria is essential when comparing and selecting a product.
Small-to-medium Deployments:
Overall ClickHouse is an excellent real-time OLAP engine suitable for small-to-medium environments. Its straightforward deployment, management, and architecture make it the preferred choice for general use cases.
Large On-Premise Deployments:
For large-scale implementations, particularly on Hadoop or similar platforms, ClickHouse, Pinot, and Druid are leading candidates. The final selection should align with specific workload requirements and use cases.
Cloud-Native Implementations:
Cloud-native deployments utilising object storage as the main persistence layer can leverage managed solutions like ClickHouse Cloud, or platforms such as StarRocks and Doris. However, consider that StarRocks and Doris introduced their decoupled architecture recently, suggesting careful evaluation for production readiness.
Log Analytics & Time-series Data:
Druid and Pinot demonstrate particular strength in processing immutable time-series data, including web logs, machine logs, and clickstream events. Their support for hybrid tables makes them ideal for Lambda-style architectures.
Unified Analytics with Query Federation:
ClickHouse, StarRocks, and Doris excel in unified analytics scenarios, offering query federation capabilities that enable seamless data access across diverse sources such as data lakes, lakehouses and DBMS systems.
Hybrid Data Warehouse-OLAP Solutions:
StarRocks and Doris provide a middle ground, combining traditional data warehouse capabilities with real-time OLAP performance. They offer comprehensive CRUD operations, complex join support (including star schema), and ACID guarantees to some extend.
Conclusion
The real-time OLAP ecosystem has evolved from specialised engines for append-only data processing into versatile analytical platforms. While ClickHouse maintains its leadership position in general-purpose deployments, newer platforms like StarRocks and Apache Doris are bridging the gap between real-time analytics and data warehouse capabilities.
The adoption of decoupled architectures and unified analytics suggests continuing evolution in this space. Organisations should evaluate their specific requirements across performance, scalability, and integration needs while considering platform maturity and community support when making their selection from these open source systems.