Guidance for AI coding agents (Claude Code, Cursor, etc.) working in this repo. Read this before making non-trivial changes β the repo's layout is intentional and the core invariants below are easy to violate accidentally.
For the user-facing overview, see README.md. For the manifest schema, see sources.schema.md. The repo root also has a CLAUDE.md β AGENTS.md symlink so Claude Code auto-loads this file; both names point at the same content.
On a fresh clone outputs/ is empty β that's expected. The outputs/v1/<slug>/ directories are gitignored and only populated by builds. Verify your environment with the read-only smoke test before doing anything heavy:
python -m scripts.pipeline.status --fast --missing-onlyIt loads sources.json, walks the manifest, and prints per-slug filesystem state in seconds with no side effects. If it errors, fix the env (uv sync) before running any build.
For a manifest sanity check that doesn't touch the filesystem at all:
python -m scripts.pipeline.validate_manifestValidates sources.json against sources.schema.json (Draft 2020-12) plus cross-checks that the schema can't express β handler-name resolution against the live registry, slug uniqueness, fetch.type/fetch.auth consistency. Sub-second; safe to invoke after any manifest edit.
For catalog queries that would otherwise require greping the ~545 KB sources.json (or scrolling ~158 KB of docs/v1/datasets.md):
python -m scripts.pipeline.list_datasets --family uci --count
python -m scripts.pipeline.list_datasets --handler tighten_types --long
python -m scripts.pipeline.list_datasets --fetch-type kaggle --kaggle-tos
python -m scripts.pipeline.list_datasets --grep '\bgeo' --longFilters compose with AND across --family, --handler, --license, --fetch-type, --reader, --vortex / --no-vortex, --kaggle-tos, --grep. Output modes: default (one slug per line), --long (wide table), --json (jq-friendly), --count.
If the user wants to browse interactively rather than query, point them at python -m scripts.pipeline.browse (read-only Textual TUI over the same data; requires uv sync --extra tui). It's a human-facing tool β don't try to run it from an agent context, since it won't render and will hang waiting for keystrokes.
For a slightly broader regression net, the tests/ directory carries a sub-second pytest smoke suite (manifest shape, schema self-consistency, handler registry, example template). Run it after any change to the manifest, the schema, or the handler registry:
uv sync --extra dev # one-time β installs pytest
pytestCopy-pasteable templates for the two most common edits live under examples/ β minimal_spec.json for new manifest entries, streaming_handler.py.tmpl for memory-constrained transform handlers.
If your agent harness supports the Agent Skills standard (Claude Code, Codex, etc.), the .agents/skills/ directory carries 16 invokable skills wrapping every pipeline entry point and procedural playbook β see .agents/skills/README.md. .claude β .agents is a symlink so both naming conventions resolve. The .agents/settings.json at the same level is a tracked allow-list of safe, read-only commands so a fresh-clone agent doesn't burn turns on permission prompts; per-machine overrides go in the gitignored .agents/settings.local.json.
docs/v1/datasets.md (~158 KB) is large; prefer targeted reads via offset/limit. Column-level / coverage / vortex-skip / hydrate-candidate views are NOT markdown anymore β they used to be (huge columns_*.md and coverage_*.md files, plus per-slug vortex_skip.md and hydrated.md listings) but those were unscannable as a reading experience and duplicated state already queryable. They're now flags on list_datasets:
python -m scripts.pipeline.list_datasets --columns [<slug>...] [--column-grep PATTERN]
python -m scripts.pipeline.list_datasets --coverage [--source parquet|vortex]
python -m scripts.pipeline.list_datasets --no-vortex --long # vortex-opted-out slugs + reasons (via --json)
python -m scripts.pipeline.list_datasets --hydrate --long # hydration candidatesHydration policy / philosophy lives in the hand-maintained HYDRATING.md (preamble only, no auto-generated per-slug list). For catalog-shape questions ("which slugs use handler X", "what's CC0-licensed") prefer list_datasets. The top-level docs/*.md mirrors are gitignored scratch and behave identically.
docs/v1/handlers.md (small β ~3 KB) is fine to read in full and carries one row per registered handler with purpose, streaming flag, the format-specific deps it imports (pandas, openpyxl, pyreadstat, osmium, zstandard, unlzw3 β pyarrow / numpy / duckdb suppressed as core), manifest spec count, and example slugs. Read it before adding a new handler so you can pick precedent and know which extras the manifest entry will need.
Raincloud is a client-reproducible pipeline for building a curated catalog of public datasets as Parquet + optional Vortex files. The single source of truth is sources.json. Everything under outputs/, the two derived docs (docs/datasets.md, docs/handlers.md), and the JSON catalog snapshot (docs/snapshot.json β read by the TUI as a fallback for unbuilt-locally slugs, AND used by docs.py itself as the row-count / file-size fallback when regenerating datasets.md on a partial build) is derived β regenerate, never hand-edit. Column-level / coverage / vortex-skip / hydrate-candidate views are queryable via list_datasets flags rather than markdown.
The pipeline flow is: fetch β extract β parse β transform β write β validate β convert (stage 7 opt-in per-spec), orchestrated by scripts.pipeline.build.
sources.jsonis authoritative. Every row of every derived artefact maps back to aDatasetSpechere. If you're tempted to hand-editdocs/*.mdor drop a parquet intooutputs/v1/<slug>/by hand β stop, fix the manifest, re-run the build, re-rundocs.py.outputs/raw_downloads/<slug>/is unversioned;outputs/v{schema_version}/<slug>/<format>/<filename>is version-scoped. Raw upstream bytes are the same regardless of output schema_version, so they're cached outside the version prefix. Within a version, artefacts live under per-format subdirectories: todayparquet/<slug>.parquetandvortex/<slug>.vortex, with room forparquet-hydrated/, partitioned variants, etc. without filename collisions. Path helpers inscripts/pipeline/spec.py:output_format_dir(slug, fmt),prepared_parquet(slug),prepared_vortex(slug). A manifest bump to v2 would populateoutputs/v2/alongsideoutputs/v1/, both sharingraw_downloads/._workdir/<slug>/is scratch. Gitignored and safe to wipe. Handlers should clean up what they put there;build.py --clean-workdiralso wipes after a successful build..archive/is gitignored and local-only. Holds Kaggle-era triage/attribution docs kept on the maintainer's tree for reference. A fresh-clone agent won't have this directory β when other docs reference it as a "fallback" alongside git history, treat git history as the only fallback you can rely on.- Always go through
spec.duckdb_connectwhen opening a DuckDB connection, notduckdb.connect(...)directly. The helper applies env-var-driven resource limits and thestorage_compatibility_version=v1.5.0setting required for persistent VARIANT writes. SeeSKILLS.mdfor detail. docs/layout is split. Top-leveldocs/*.mdis gitignored scratch β regenerable against a subset of parquets for local type-coverage experiments.docs/v{schema_version}/*.mdis the tracked canonical snapshot matchingoutputs/v{n}/. Regenerating docs viascripts.pipeline.docswrites to the top-level path; promotion todocs/v{n}/is a manual copy.
The seven stages are in scripts/pipeline/ and are each independently invokable:
| Stage | Module | Reads | Writes |
|---|---|---|---|
| fetch | fetch.py |
fetch.* |
outputs/raw_downloads/<slug>/ |
| extract | extract.py |
extract.* |
_workdir/<slug>/ |
| parse | parse.py |
parse.* |
in-memory (Path, Table) tuples |
| transform | transform.py |
transform.* |
in-memory (slug, Table) tuples or direct-to-parquet (streaming handlers) |
| write | write.py |
write.* |
outputs/v{n}/<slug>/parquet/<slug>.parquet |
| validate | validate.py |
expect.* |
raises on mismatch unless --loose |
| convert | convert.py |
convert.* |
outputs/v{n}/<slug>/vortex/<slug>.vortex (when convert.vortex = true); ALSO outputs/v{n}/<slug>/vortex-hydrated/<slug>.vortex when a hydrated parquet exists. Same flag governs both pairs. |
| hydrate (opt-in, off the default build path) | hydrate.py |
hydrate.* |
outputs/v{n}/<slug>/parquet-hydrated/<slug>.parquet (only when hydrate is set; safety-filter-gated; outbound HTTP). Auto-runs convert at the end when convert.vortex = true. |
Streaming handlers (factbook_variant_parse, jsonbench_variant_parse, wikipedia_variant_parse, lichess_pgn_parse, stack_exchange_split, osm_pbf_split, public_bi_merge) write the parquet themselves and return [] β the write stage becomes a no-op.
Some fetch.type: "kaggle" entries carry fetch.requires_interactive_accept: true β those datasets are gated behind a one-time click-through ToS acceptance on the Kaggle web UI and can't be built on a fresh Kaggle account without that manual step. See SKILLS.md for the pattern.
The manifest is a large hand-authored JSON file (~545 KB, 249 dataset entries) with a specific top-level key order (schema_version, generated_at, audit_cutoff, notes, datasets). Stick with small Python scripts for edits:
import json
from pathlib import Path
SRC = Path("sources.json") # run from the repo root, or use an absolute path
m = json.loads(SRC.read_text())
for d in m["datasets"]:
if d["slug"] == "target-slug":
d["transform"]["handler"] = "new_handler"
break
SRC.write_text(json.dumps(m, indent=2) + "\n")Don't use sed or text-based edits β JSON-safe structural edits are cheap and avoid accidental quoting breakage.
Building a single large dataset can take hours (observed: JSONBench 100M β 6 h; OSM Germany extract β 45 min per element kind; Wikipedia Structured Contents β 34 GB parquet, multi-hour). The outputs/v1/<slug>/parquet/<slug>.parquet + vortex/<slug>.vortex pair on disk already reflects a full catalog build β rebuilding wipes and redoes that work. Before running python -m scripts.pipeline.build <slug> on anything non-trivial, confirm with the user.
Small (<100 MB) parquets are fine to rebuild without asking.
python -m scripts.pipeline.docs # datasets.md + handlers.md + snapshot.jsonAll three derived artefacts regenerate in one pass by default. Run this after any build, convert run, in-place parquet mutation, or when a handler is added/removed/renamed (handlers.md regenerates from the registry + manifest).
Keep docs/snapshot.json fresh β it's load-bearing. datasets.md regen reads from disk for slugs you've built locally and falls back to docs/snapshot.json (or docs/v{schema_version}/snapshot.json on a fresh clone) for everything else. Without that fallback, regenerating on a partial build would dash-out 200+ rows and silently destroy ground truth in the tracked snapshot. The default no-args invocation regens snapshot + datasets in lockstep, so it's only at risk if you do partial regens β docs.py datasets alone won't refresh the snapshot. After a build, prefer the no-args form.
- No Kaggle-era narrative. The legacy triage / binary-blob-integration / Kaggle-filter history lives in
.archive/. New README/AGENTS/SKILLS content should reflect only the current three-point intent: fetch β transform β outputs. - One handler per upstream shape. Don't shoehorn a new shape into
tighten_typesoridentity; write a dedicated handler underscripts/pipeline/handlers/and register it inhandlers/__init__.py. - Handlers are short. Most are under 150 lines. If a new handler balloons past that, look for reuse opportunities with existing helpers (
duckdb_connect,outputs_root,spec_field). - No backwards-compat stubs. When removing a handler or slug, remove it fully β git history (and the maintainer's local
.archive/) is the fallback, not half-wired shims.
Prefer Read β Grep β ask, over guessing. The pipeline has hidden contracts (streaming handlers returning [], raw_downloads being unversioned, VARIANT requiring storage_compatibility_version) that aren't obvious from any single file.