Thanks for your interest in Raincloud. This guide covers how to set up a dev
environment, run the test suite, and submit changes. For deeper dives into the
pipeline itself, see README.md, AGENTS.md, and
SKILLS.md.
git clone git@github.com:spiraldb/raincloud.git
cd raincloud
uv sync --extra dev--extra dev pulls in pytest. Add --extra kaggle or --extra huggingface
if your work touches those upstream types, or --extra all for everything.
Three sub-second checks are the minimum gate (CI runs all three):
ruff check # lint (pyflakes + pycodestyle + isort)
python -m scripts.pipeline.validate_manifest # JSON Schema + cross-checks on sources.json
pytest # smoke regression net (manifest, schema, registry, examples)If you touched the build pipeline, also run a small end-to-end build to make sure it still produces the expected output:
python -m scripts.pipeline.build countries-of-the-world # ~200 ms, 227 rowsFor larger builds, see SKILLS.md.
- New datasets β see
SKILLS.md. Most entries copyexamples/minimal_spec.jsonand pick an existing handler fromdocs/v1/handlers.md. - New transform handlers β see
SKILLS.md. One handler per upstream shape; register inscripts/pipeline/handlers/__init__.py. - Bug fixes β start with a failing test where practical.
- Documentation β README/AGENTS/SKILLS edits welcome. The two derived docs
(
docs/datasets.md,docs/handlers.md) are machine-generated; don't hand-edit them β fix the manifest or the registry and regenerate viapython -m scripts.pipeline.docs.
Add a test alongside any new behaviour:
- New transform handler β a fixture-based test demonstrating the
expected output shape (small in-memory
pa.Table; see existing handler tests intests/test_manifest.pyfor the pattern). - New manifest field or schema rule β extend
test_manifest.pyto assert it validates as expected. - New CLI flag β extend the relevant
test_*.py(e.g.test_list_datasets.pyfor catalog-filter flags). - Bug fix β a failing test that the fix turns green.
pytest is the minimum pre-PR gate (see Before you open a PR);
CI re-runs it on every PR via .github/workflows/ci.yml.
- Branch off
develop. Branch names follow<initials>/<topic>(e.g.mp/add-fastlanes). - Open PRs against
develop. - Commit messages: short imperative subject ("add X", "fix Y", "swap Z to W"), optional body explaining why the change is needed.
Open an issue on GitHub Issues. Include the slug you were building, the command you ran, and any traceback.
For security-related issues, do not open a public issue β see
SECURITY.md for the private channel.
- Python β₯ 3.11. Match the style of nearby code; the repo prefers terse, comment-light Python with explicit names over abstractions.
- No backwards-compat stubs or shims when removing handlers/slugs β git history is the fallback.
- Always go through
scripts.pipeline.spec.duckdb_connectfor DuckDB connections so resource limits andstorage_compatibility_version=v1.5.0apply (seeAGENTS.md).
By submitting a PR, you agree that your contribution will be licensed under the Apache License 2.0, the same license that covers the rest of the project.