Use Delta Lake ? if you’re in Spark/Databricks world

Apache Hudi

  • Created by: Uber.

  • Best for: Real-time data ingestion and incremental processing.

  • Key Features:

    • Supports streaming upserts & deletes (unique among the three).

    • Maintains two storage types:

      • Copy-on-Write (CoW) → batch-friendly, immutable.

      • Merge-on-Read (MoR) → streaming-friendly, allows near real-time updates.

    • Designed for low-latency ingestion pipelines.

👉 Example use: Keeping a user transactions table updated in near real-time for fraud detection.


⚖️ Quick Comparison

FeatureDelta LakeApache IcebergApache Hudi
OriginDatabricksNetflix (Apache)Uber (Apache)
Best ForBatch + StreamingLarge-scale analyticsReal-time ingestion
ACID Transactions
Time Travel
Upserts/DeletesGoodLimited (rewrite approach)Excellent (streaming-first)
Engine SupportSpark-heavySpark, Flink, Trino, etc.Spark, Flink, Hive, Presto