DuckTales is a comprehensive demonstration suite for DuckLake, DuckDB's revolutionary lakehouse format that simplifies data management by using SQL databases for metadata instead of complex file-based systems.
This project showcases the five key scenarios from the article "DuckTales: Rethinking the Lakehouse with a Duck and a Plan", demonstrating how DuckLake solves real-world problems that traditional lakehouse formats struggle with.
DuckLake is a new open table format that reimagines the lakehouse architecture by:
- Using SQL for metadata: All metadata lives in a standard SQL database (PostgreSQL, MySQL, DuckDB, etc.)
- Storing data in open formats: Data files remain in Parquet on blob storage
- Providing true ACID guarantees: Full transactional support across multiple tables
- Simplifying operations: No complex file hierarchies, manifest files, or pointer swapping
Shows how DuckLake maintains transactional consistency across multiple tables - something traditional formats can't do.
Key Features:
- Multi-table transactions
- Automatic rollback on errors
- Cross-table consistency guarantees
Demonstrates DuckLake's powerful time travel capabilities for investigating data issues and recovering from accidents.
Key Features:
- Query data at any point in time
- Investigate when changes occurred
- Recover accidentally deleted data
- Create audit logs using time travel
Showcases transactional DDL operations that allow schema changes while applications continue running.
Key Features:
- Add columns with defaults
- Change data types
- Add constraints
- All changes are transactional
Compares DuckLake's efficiency against traditional formats for frequent small updates.
Key Features:
- Dramatic reduction in file count
- Optional inlining of small changes
- Performance comparison metrics
- Storage efficiency analysis
Demonstrates seamless transition from local development to production with different catalog backends.
Key Features:
- Local development with DuckDB
- Migration to PostgreSQL/MySQL
- Multi-environment support
- Zero code changes required
- Python 3.7+
- DuckDB v1.3.0 or later
-
Clone this repository:
git clone https://github.com/TFMV/ducktales.git cd ducktales/DuckTales -
Run the setup script:
chmod +x scripts/setup.sh ./scripts/setup.sh
-
Install Python dependencies:
pip install -r requirements.txt
cd demos
chmod +x run_all_demos.sh
./run_all_demos.shcd demos/01_transaction_rollback
chmod +x demo.sh
./demo.shDuckTales/
βββ README.md
βββ requirements.txt
βββ scripts/
β βββ setup.sh # Installation script
βββ utils/
β βββ ducklake_utils.py # Common utilities
βββ demos/
β βββ run_all_demos.sh # Run all demos
β βββ 01_transaction_rollback/
β β βββ demo.py
β β βββ demo.sh
β βββ 02_time_travel/
β β βββ demo.py
β β βββ demo.sh
β βββ 03_schema_evolution/
β β βββ demo.py
β β βββ demo.sh
β βββ 04_small_file_optimization/
β β βββ demo.py
β β βββ demo.sh
β βββ 05_catalog_portability/
β βββ demo.py
β βββ demo.sh
βββ exploration/ # Advanced analysis scripts
β βββ ducklake_analysis.sh # Comprehensive DuckLake behavior analysis
β βββ schema_analysis.sh # Catalog schema and metadata analysis
β βββ benchmark_ducklake.sh # Performance benchmarking
β βββ run_all_analysis.sh # Run all analysis scripts
βββ data/ # Test data
β βββ parquet/
β βββ flights/
β βββ lineitem/
β βββ customer/
The exploration directory contains a suite of analysis tools for understanding DuckLake's behavior and performance:
-
ducklake_analysis.sh: Comprehensive analysis of DuckLake's behavior, including:
- Metadata operations tracking
- File system state monitoring
- Time travel capabilities
- Catalog introspection
-
schema_analysis.sh: Deep dive into DuckLake's catalog structure:
- Schema evolution tracking
- Metadata table relationships
- System function analysis
- Catalog backend compatibility
-
benchmark_ducklake.sh: Performance benchmarking suite:
- Transaction throughput
- Storage efficiency
- Metadata operation latency
- Comparison with traditional formats
cd exploration
chmod +x run_all_analysis.sh
./run_all_analysis.shAnalysis results are stored in the notes/ducklake_results directory, with detailed traces in notes/ducklake_traces.
DuckLake's core innovation is using a SQL database for all metadata operations:
-- All metadata operations are just SQL transactions
BEGIN TRANSACTION;
INSERT INTO ducklake_data_file VALUES (...);
INSERT INTO ducklake_table_stats VALUES (...);
INSERT INTO ducklake_snapshot VALUES (...);
COMMIT;Query any table at a specific point in time:
-- Query at specific version
SELECT * FROM customers AT (VERSION => 42);
-- Query at specific timestamp
SELECT * FROM customers AT (TIMESTAMP => '2024-01-15 14:00:00');DuckLake supports multiple catalog backends:
-- Local development
ATTACH 'ducklake:local.ducklake' AS dev;
-- PostgreSQL production
ATTACH 'ducklake:postgresql://host/database' AS prod;
-- MySQL production
ATTACH 'ducklake:mysql://host/database' AS prod;Based on our demos, DuckLake provides:
- 99% fewer files for frequent small updates
- Sub-millisecond writes with inlining
- 1000x more concurrent writers than traditional formats
- Single SQL query for metadata vs. multiple HTTP calls
This project is licensed under the MIT License.
The duck has landed. The lake is calling. π¦
