Skip to content

demosense/tidylake

Repository files navigation

tidylake

Python uv Ruff pre-commit Taskfile Docker Mermaid Pytest coverage


tidylake is an agnostic framework for managing data operations in your lakehouse using your favorite tools.

This project is currently under active development, it is currently production tested but future releases will likely include breaking changes.

Purpose

tidylake gives data teams a common ground between:

  • Transformation code
  • Metadata and contracts
  • Operational workflows

It helps you manage the data product lifecycle without locking your project to a single engine, notebook style, or orchestrator.

Why Use It

The key advantages are:

  • Framework agnostic by design: keep using pandas, Spark, Iceberg, and your own stack.
  • Metadata-first workflow: manifests act as the single source of truth for schema and semantics.
  • Better collaboration across personas: analysts and engineers can work on the same assets with less friction.
  • One codebase for interactive and batch work: iterate safely in notebooks and run the same logic in production.
  • Built-in structure for automation: lineage discovery, CLI execution, and plugin-based extension points.

Documentation

For full setup, concepts, and end-to-end examples, go to the documentation:

Minimal Example

The docs include complete runnable examples. This is a minimal sketch of what a tidylake data product looks like.

Create a manifest (silver_customers.yml):

data_product:
  name: silver_customers
  description: Customer profile from CRM
  script: silver_customers
  schema:
    properties:
      customer_id:
        type: string
      customer_name:
        type: string

Link it to a script (silver_customers.py):

import pandas as pd
from tidylake import get_or_create_context

product = get_or_create_context().get_data_product("silver_customers")

@product.add_input()
def bronze_customers():
    return pd.read_parquet("/tmp/bronze_customers")

df = bronze_customers()[["customer_id", "customer_name"]]

@product.set_sink()
def write_silver_customers():
    df.to_parquet("/tmp/silver_customers", index=False)

Then use the CLI:

tidylake list
tidylake run

You can extend tidylake with plugins to integrate storage, compute, and catalog services from your existing stack.

Contributing

See CONTRIBUTING.md for development setup and contribution guidelines.

License

This project is open source, released under the Apache License, and brought to you by the Taidy team.

About

Agnostic framework for managing data operations in your lakehouse using your favorite tools.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors