- Metadata extraction (C++17 parser)
- Experimental column decoder
- Encodings: PLAIN, RLE_DICTIONARY
- Compression: UNCOMPRESSED, SNAPPY, ZSTD
- High-performance columnar reader
- SIMD optimizations (19% faster)
- Schema inference
- Projection pushdown
Target: v0.2.0
Timeline: Q2 2026
Rationale:
- Natural companion to Parquet
- De facto standard for Hive/Hadoop ecosystems
- Similar optimization opportunities (predicate pushdown, stripe-level skipping)
- Can reuse compression infrastructure
Deliverables:
- Metadata reader (schema, stripe statistics)
- Basic column decoder (primitives + strings)
- Compression support (reuse ZSTD/Snappy, add Zlib)
- Predicate pushdown support
Complexity: Medium-High
Estimated Effort: 8-12 weeks
Target: v0.3.0
Timeline: Q3 2026
Rationale:
- Universal data exchange format
- Excellent fit for rugo's SIMD expertise
- Performance gap vs existing readers is large
- Common first step in ETL pipelines
Deliverables:
- SIMD-optimized delimiter scanning (AVX2/SSE2)
- Schema inference with columnar output
- Projection by column index
- RFC 4180 compliance (quoted fields, escaping)
- Dialect detection (delimiter, quote char)
Complexity: Medium
Estimated Effort: 4-6 weeks
Target: v0.4.0
Timeline: Q4 2026
Rationale:
- Industry standard for Kafka message schemas
- Growing use in streaming pipelines
- Unique opportunity: columnar extraction from row format
- Schema evolution support
Deliverables:
- Schema-driven binary decoder
- Batch decode rows → columns optimization
- Projection pushdown (field skipping)
- Container format support (compression, sync markers)
Complexity: Medium
Estimated Effort: 5-7 weeks
Target: v0.5.0
Timeline: Q1 2027
Rationale:
- Growing adoption in Python ecosystem
- Lightweight alternative to PyArrow for metadata
- Zero-copy opportunities
- Native format for polars
Deliverables:
- Metadata extraction (schema, statistics)
- Memory-mapped column access
- Compression support (LZ4, ZSTD)
- Selective column loading
Complexity: Medium-High
Estimated Effort: 6-8 weeks
Target: TBD
Condition: User demand
Rationale:
- Binary format, more compact than JSON
- Used in messaging systems
- Lower priority due to limited analytics adoption
Estimated Effort: 3-4 weeks
Formats are selected based on:
- Market Adoption - Widely used in data analytics/engineering
- Performance Opportunity - General readers leave room for optimization
- Columnar Fit - Format amenable to columnar extraction
- Ecosystem Gap - No lightweight/fast alternative exists
- Code Reuse - Can leverage existing infrastructure
Too complex, well-served by openpyxl/xlrd
Verbose, declining usage in data pipelines
Scientific computing focus, not data analytics
RPC-focused, requires external schemas
Legacy, better handled by database connectors
For each new format:
- Start with Metadata - Like Parquet, metadata-only reader first
- Basic Decoding - Primitives + strings for 80% of use cases
- Leverage Infrastructure - Reuse compression, SIMD, memory handling
- Maintain Philosophy - No runtime dependencies (beyond stdlib)
- Incremental Rollout - Experimental → Beta → Stable
- ✅ Snappy (Parquet, future ORC)
- ✅ ZSTD (Parquet, future ORC, Arrow)
⚠️ Need: Zlib (ORC)⚠️ Need: LZ4 (Arrow, ORC)
- ✅ Thrift (Parquet metadata)
⚠️ Need: Protocol Buffers (ORC metadata)⚠️ Need: FlatBuffers (Arrow metadata)
- ✅ AVX2/SSE2 text scanning (JSONL)
- ✅ Can reuse for CSV, Avro
- ✅ New opportunities: Binary parsing
- 2-5x faster than pandas/PyArrow for target operations
- Maintain 15-20% SIMD optimization benefit
- Sub-second metadata extraction for files <1GB
- Metadata: 100% of common fields
- Types: int32/64, float32/64, string, boolean (minimum)
- Compression: SNAPPY, ZSTD (minimum)
- Advanced: Nested types (stretch goal)
- PyPI downloads growth
- User feedback on format support
- Integration with Orso and other ecosystems
This roadmap is not final. We welcome feedback on:
- Format priorities - Which formats would benefit you most?
- Use cases - What specific optimizations matter for your workflows?
- Performance targets - What speedups would make format support valuable?
Please open GitHub issues with feature requests or comments on the roadmap.
- ✅ Completed - Fully supported
- 🎯 Planned - Committed for future release
- 🔮 Proposed - Under consideration
- ⏳ Conditional - Depends on user demand
- ❌ Not Planned - Out of scope
Last Updated: 2025-10-23