tidylake is an agnostic framework for managing data operations in your lakehouse using your favorite tools.
This project is currently under active development, it is currently production tested but future releases will likely include breaking changes.
tidylake gives data teams a common ground between:
- Transformation code
- Metadata and contracts
- Operational workflows
It helps you manage the data product lifecycle without locking your project to a single engine, notebook style, or orchestrator.
The key advantages are:
- Framework agnostic by design: keep using pandas, Spark, Iceberg, and your own stack.
- Metadata-first workflow: manifests act as the single source of truth for schema and semantics.
- Better collaboration across personas: analysts and engineers can work on the same assets with less friction.
- One codebase for interactive and batch work: iterate safely in notebooks and run the same logic in production.
- Built-in structure for automation: lineage discovery, CLI execution, and plugin-based extension points.
For full setup, concepts, and end-to-end examples, go to the documentation:
The docs include complete runnable examples. This is a minimal sketch of what a tidylake data product looks like.
Create a manifest (silver_customers.yml):
data_product:
name: silver_customers
description: Customer profile from CRM
script: silver_customers
schema:
properties:
customer_id:
type: string
customer_name:
type: stringLink it to a script (silver_customers.py):
import pandas as pd
from tidylake import get_or_create_context
product = get_or_create_context().get_data_product("silver_customers")
@product.add_input()
def bronze_customers():
return pd.read_parquet("/tmp/bronze_customers")
df = bronze_customers()[["customer_id", "customer_name"]]
@product.set_sink()
def write_silver_customers():
df.to_parquet("/tmp/silver_customers", index=False)Then use the CLI:
tidylake list
tidylake runYou can extend tidylake with plugins to integrate storage, compute, and catalog services from your existing stack.
See CONTRIBUTING.md for development setup and contribution guidelines.
This project is open source, released under the Apache License, and brought to you by the Taidy team.