Releases: spiraldb/raincloud
Releases Β· spiraldb/raincloud
v0.1.1 β convert streaming, docs/snapshot fallback
Added
- README badges (CI status, latest release, license, citation).
Changed
- Convert stage now streams parquet batches via
pf.iter_batches() β RecordBatchReader β vxio.writeinstead of materialising whole tables.
ResolvesArrowNotImplementedError: Nested data conversions not implemented for chunked array outputsfrom pyarrow on slugs whose nested columns
(list<struct>,struct<bytes,β¦>) would need to be chunked across multiple
Arrow arrays. Re-enables Vortex output forosm-germany-ways,
ultrachat-200k,mmmu,websight-v01,peoples-speech-clean-validation. code-contestsVortex skip re-diagnosed: not the chunked-array path; a
separate upstream FSST i32-offset overflow onlist<string>>2 GB.open-food-factsdescription aligned with shipped output (currently a
singleraw_json: stringcolumn viajsonl_as_string_parse; VARIANT
promotion deferred).- PR template: dropped the "Test plan" checklist (CI runs the same gates on
every PR; CONTRIBUTING.md documents them once). - Agent-tooling docs (AGENTS.md, SKILLS.md,
raincloud-docsskill) now flag
docs/snapshot.jsonas load-bearing β TUI fallback and the
row-count / file-size fallback fordatasets.mdregen. Stale "six derived
docs" reference in AGENTS.md cleaned up to three.
Fixed
docs/datasets.mdregeneration now falls back todocs/snapshot.json
(top-level scratch, thendocs/v{schema_version}/snapshot.jsonon a fresh
clone) for slugs whose parquet isn't built locally. Previously,
partial-build regen would silently dash-out row counts and file sizes for
any slug not on disk, destroying ground truth in the v1 snapshot. Snapshot
regen now also captureslast_built_row_groups. Five regression tests
added intests/test_docs.py.
v0.1.0 β initial public release
Initial public release.
Raincloud is a client-reproducible pipeline for building a curated catalog
of public datasets as analytics-ready Parquet + Vortex files. See
README.md for the user-facing overview,
AGENTS.md for the architecture, and
SKILLS.md for procedural playbooks.
This release bundles:
- The 7-stage build pipeline (fetch β extract β parse β transform β write
β validate β convert) plus the optional opt-in hydrate stage. - 249 dataset specs across 5 families (
direct,kaggle-upstream,
nyc-tlc,public-bi,uci). - 24 named transform handlers covering CSV / Parquet / JSONL / XML / PBF /
custom-format upstreams plus streaming variants for memory-constrained
shapes. - A read-only Textual TUI for browsing the catalog
(python -m scripts.pipeline.browse, requires--extra tui). - Per-dataset Vortex conversion via the
convert.vortexflag. - Apache License 2.0, with SPDX file headers on all Python sources.
- Governance:
SECURITY.md,CONTRIBUTING.md,CODE_OF_CONDUCT.md
(Contributor Covenant 2.1),DISCLAIMER.md(AS IS posture, content
and license disclaimers, dataset-removal reporting), and
HYDRATING.md(policy for the optional hydrate stage). - Tooling:
rufflint (rulesE,F,W,I) + GitHub Actions CI
(.github/workflows/ci.yml) running lint, manifest validation, and
pyteston every push and PR todevelop. - Dataset-removal issue template
(.github/ISSUE_TEMPLATE/dataset-removal.yml) β structured form for
the channelDISCLAIMER.mdpoints readers at. - Pull-request template (
.github/pull_request_template.md) prompting
for summary, test-plan checkbox list against the standard pre-PR gate,
and change-type tags. CITATION.cffβ GitHub-native citation metadata; surfaces the "Cite
this repository" button in the repo sidebar with BibTeX / APA / Chicago
exports.