Skip to content

Releases: spiraldb/raincloud

v0.1.1 β€” convert streaming, docs/snapshot fallback

07 May 20:04
1575c06

Choose a tag to compare

Added

  • README badges (CI status, latest release, license, citation).

Changed

  • Convert stage now streams parquet batches via pf.iter_batches() β†’ RecordBatchReader β†’ vxio.write instead of materialising whole tables.
    Resolves ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs from pyarrow on slugs whose nested columns
    (list<struct>, struct<bytes,…>) would need to be chunked across multiple
    Arrow arrays. Re-enables Vortex output for osm-germany-ways,
    ultrachat-200k, mmmu, websight-v01, peoples-speech-clean-validation.
  • code-contests Vortex skip re-diagnosed: not the chunked-array path; a
    separate upstream FSST i32-offset overflow on list<string> >2 GB.
  • open-food-facts description aligned with shipped output (currently a
    single raw_json: string column via jsonl_as_string_parse; VARIANT
    promotion deferred).
  • PR template: dropped the "Test plan" checklist (CI runs the same gates on
    every PR; CONTRIBUTING.md documents them once).
  • Agent-tooling docs (AGENTS.md, SKILLS.md, raincloud-docs skill) now flag
    docs/snapshot.json as load-bearing β€” TUI fallback and the
    row-count / file-size fallback for datasets.md regen. Stale "six derived
    docs" reference in AGENTS.md cleaned up to three.

Fixed

  • docs/datasets.md regeneration now falls back to docs/snapshot.json
    (top-level scratch, then docs/v{schema_version}/snapshot.json on a fresh
    clone) for slugs whose parquet isn't built locally. Previously,
    partial-build regen would silently dash-out row counts and file sizes for
    any slug not on disk, destroying ground truth in the v1 snapshot. Snapshot
    regen now also captures last_built_row_groups. Five regression tests
    added in tests/test_docs.py.

v0.1.0 β€” initial public release

06 May 03:50

Choose a tag to compare

Initial public release.

Raincloud is a client-reproducible pipeline for building a curated catalog
of public datasets as analytics-ready Parquet + Vortex files. See
README.md for the user-facing overview,
AGENTS.md for the architecture, and
SKILLS.md for procedural playbooks.

This release bundles:

  • The 7-stage build pipeline (fetch β†’ extract β†’ parse β†’ transform β†’ write
    β†’ validate β†’ convert) plus the optional opt-in hydrate stage.
  • 249 dataset specs across 5 families (direct, kaggle-upstream,
    nyc-tlc, public-bi, uci).
  • 24 named transform handlers covering CSV / Parquet / JSONL / XML / PBF /
    custom-format upstreams plus streaming variants for memory-constrained
    shapes.
  • A read-only Textual TUI for browsing the catalog
    (python -m scripts.pipeline.browse, requires --extra tui).
  • Per-dataset Vortex conversion via the convert.vortex flag.
  • Apache License 2.0, with SPDX file headers on all Python sources.
  • Governance: SECURITY.md, CONTRIBUTING.md, CODE_OF_CONDUCT.md
    (Contributor Covenant 2.1), DISCLAIMER.md (AS IS posture, content
    and license disclaimers, dataset-removal reporting), and
    HYDRATING.md (policy for the optional hydrate stage).
  • Tooling: ruff lint (rules E, F, W, I) + GitHub Actions CI
    (.github/workflows/ci.yml) running lint, manifest validation, and
    pytest on every push and PR to develop.
  • Dataset-removal issue template
    (.github/ISSUE_TEMPLATE/dataset-removal.yml) β€” structured form for
    the channel DISCLAIMER.md points readers at.
  • Pull-request template (.github/pull_request_template.md) prompting
    for summary, test-plan checkbox list against the standard pre-PR gate,
    and change-type tags.
  • CITATION.cff β€” GitHub-native citation metadata; surfaces the "Cite
    this repository" button in the repo sidebar with BibTeX / APA / Chicago
    exports.