Releases · spiraldb/raincloud

07 May 20:04

mprammer

v0.1.1

1575c06

v0.1.1 — convert streaming, docs/snapshot fallback Latest

Latest

Added

README badges (CI status, latest release, license, citation).

Changed

Convert stage now streams parquet batches via pf.iter_batches() → RecordBatchReader → vxio.write instead of materialising whole tables.
Resolves ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs from pyarrow on slugs whose nested columns
(list<struct>, struct<bytes,…>) would need to be chunked across multiple
Arrow arrays. Re-enables Vortex output for osm-germany-ways,
ultrachat-200k, mmmu, websight-v01, peoples-speech-clean-validation.
code-contests Vortex skip re-diagnosed: not the chunked-array path; a
separate upstream FSST i32-offset overflow on list<string> >2 GB.
open-food-facts description aligned with shipped output (currently a
single raw_json: string column via jsonl_as_string_parse; VARIANT
promotion deferred).
PR template: dropped the "Test plan" checklist (CI runs the same gates on
every PR; CONTRIBUTING.md documents them once).
Agent-tooling docs (AGENTS.md, SKILLS.md, raincloud-docs skill) now flag
docs/snapshot.json as load-bearing — TUI fallback and the
row-count / file-size fallback for datasets.md regen. Stale "six derived
docs" reference in AGENTS.md cleaned up to three.

Fixed

docs/datasets.md regeneration now falls back to docs/snapshot.json
(top-level scratch, then docs/v{schema_version}/snapshot.json on a fresh
clone) for slugs whose parquet isn't built locally. Previously,
partial-build regen would silently dash-out row counts and file sizes for
any slug not on disk, destroying ground truth in the v1 snapshot. Snapshot
regen now also captures last_built_row_groups. Five regression tests
added in tests/test_docs.py.

Assets 2

06 May 03:50

mprammer

v0.1.0

9e5c8ab

v0.1.0 — initial public release

Initial public release.

Raincloud is a client-reproducible pipeline for building a curated catalog
of public datasets as analytics-ready Parquet + Vortex files. See
README.md for the user-facing overview,
AGENTS.md for the architecture, and
SKILLS.md for procedural playbooks.

This release bundles:

The 7-stage build pipeline (fetch → extract → parse → transform → write
→ validate → convert) plus the optional opt-in hydrate stage.
249 dataset specs across 5 families (direct, kaggle-upstream,
nyc-tlc, public-bi, uci).
24 named transform handlers covering CSV / Parquet / JSONL / XML / PBF /
custom-format upstreams plus streaming variants for memory-constrained
shapes.
A read-only Textual TUI for browsing the catalog
(python -m scripts.pipeline.browse, requires --extra tui).
Per-dataset Vortex conversion via the convert.vortex flag.
Apache License 2.0, with SPDX file headers on all Python sources.
Governance: SECURITY.md, CONTRIBUTING.md, CODE_OF_CONDUCT.md
(Contributor Covenant 2.1), DISCLAIMER.md (AS IS posture, content
and license disclaimers, dataset-removal reporting), and
HYDRATING.md (policy for the optional hydrate stage).
Tooling: ruff lint (rules E, F, W, I) + GitHub Actions CI
(.github/workflows/ci.yml) running lint, manifest validation, and
pytest on every push and PR to develop.
Dataset-removal issue template
(.github/ISSUE_TEMPLATE/dataset-removal.yml) — structured form for
the channel DISCLAIMER.md points readers at.
Pull-request template (.github/pull_request_template.md) prompting
for summary, test-plan checkbox list against the standard pre-PR gate,
and change-type tags.
CITATION.cff — GitHub-native citation metadata; surfaces the "Cite
this repository" button in the repo sidebar with BibTeX / APA / Chicago
exports.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Added

Changed

Fixed

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: spiraldb/raincloud

v0.1.1 — convert streaming, docs/snapshot fallback

Added

Changed

Fixed

Uh oh!

v0.1.0 — initial public release

Uh oh!