Skip to content

nssalian/parx

Repository files navigation

PARX

CI crates.io docs.rs Maven Central

Early Development: This project is in active development. Format and APIs may change.

Persistent metadata caching for Parquet files.

What It Does

PARX caches Parquet metadata in sidecar files (.parx) to eliminate repeated metadata fetches.

The problem: Parquet stores metadata at the end of files. In remote and object-store read paths, readers often need multiple metadata-oriented requests to discover file size, fetch the tail, and load the footer. Repeated access amplifies that overhead.

The solution: Cache metadata once in a .parx sidecar. Readers fetch the sidecar directly instead of repeatedly re-reading footer metadata from the .parquet file.

file.parquet (2.7 MB)
file.parquet.parx (282 KB)

Format

Single-file format:

┌──────────────────────────────────────────┐
│ Header (16 bytes)                        │
│  - Magic: "PARX"                         │
│  - Version, Flags                        │
├──────────────────────────────────────────┤
│ Footer Payload (variable, raw/compressed)│
│  - Raw Parquet footer bytes              │
├──────────────────────────────────────────┤
│ Page Index Payload (optional)            │
│  - ColumnIndex + OffsetIndex             │
├──────────────────────────────────────────┤
│ Manifest (Protobuf)                      │
│  - Offsets, lengths, checksums           │
│  - Source file size                      │
├──────────────────────────────────────────┤
│ Trailer (12 bytes)                       │
│  - Manifest length, CRC32C               │
│  - Magic: "PARX"                         │
└──────────────────────────────────────────┘

Bundle format (for directories):

┌──────────────────────────────────────────┐
│ Bundle Header (24 bytes)                 │
│  - Magic: "PRXB"                         │
│  - Version, Flags                        │
│  - Entry count                           │
├──────────────────────────────────────────┤
│ Entry 0: Footer (+ optional page indexes)│
├──────────────────────────────────────────┤
│ Entry 1: Footer (+ optional page indexes)│
├──────────────────────────────────────────┤
│ ... (N entries)                          │
├──────────────────────────────────────────┤
│ Bundle Manifest (Protobuf)               │
│  - Path→Entry mapping                    │
├──────────────────────────────────────────┤
│ Trailer (12 bytes)                       │
│  - Manifest length, CRC32C               │
│  - Magic: "PRXB"                         │
└──────────────────────────────────────────┘

See FORMAT_SPEC.md for detailed byte-level layout. Bundle entries can optionally include page-index payloads using policy-controlled caps.

Building

# Core library
cd implementations/rust/parx
cargo build --release

# Java library
cd implementations/java/parx
./gradlew build

# CLI tool (install to ~/.cargo/bin)
cd implementations/rust/parx-cli
cargo install --path . --locked

# Benchmarks
cd benchmarks/parx_benchmarks
make all

If parx is not found after install, ensure ~/.cargo/bin is on your PATH.

CLI Usage

# Build sidecar for single file
parx build file.parquet

# Verify sidecar
parx verify file.parquet.parx

# Inspect contents
parx inspect file.parquet.parx

# Bundle directory
parx bundle build /data/events/
# Creates: /data/events/_parx_bundle.parx

# Bundle directory with page indexes (optional, capped)
parx bundle build /data/events/ \
  --include-page-indexes \
  --max-page-index-bytes-per-file 262144 \
  --max-total-page-index-bytes 16777216

# Extract bundle
parx bundle extract /data/events/_parx_bundle.parx --output /output/

Library Usage

use parx_rs::{ParxReader, ParxWriter};

// Write: build from Parquet file directly
let mut writer = ParxWriter::from_parquet_file("file.parquet")?;
let parx_bytes = writer.finish();
std::fs::write("file.parquet.parx", parx_bytes)?;

// Read: load cached footer from .parx sidecar
let parx_data = std::fs::read("file.parquet.parx")?;
let reader = ParxReader::open(&parx_data)?;
let footer = reader.footer_bytes(); // Raw Parquet footer, ready to use

Java Implementation

There is also a Java implementation in implementations/java/parx, which includes:

  • single-file sidecar read/write
  • bundle read/write
  • compression and validation APIs
  • header and bundle metadata accessors

Useful docs:

  • docs/JAVA_CHANGES.md
  • docs/RUST_CHANGES.md

Benchmarks

Local tests with 4 schema types (simple, medium, wide, nested):

Arrow async vs PARX:

  • Requests: 3.0 → 1.0 per file (66.7% reduction)
  • Latency: ~237 µs → ~74 µs (3.22x faster)
  • Bytes: ~24 KB → ~25 KB (2.2% overhead)

Note: this benchmark measures metadata read path with prebuilt .parx sidecars. One-time sidecar creation is excluded.

Benchmark environment:

  • Apple MacBook Pro (MacBookPro18,3)
  • Apple M1 Pro, 10 CPU cores
  • 16 GB memory
  • macOS 14.6.1

Run benchmarks:

cd benchmarks/parx_benchmarks
make arrow-vs-parx  # Arrow vs PARX comparison
make prefetch       # Prefetch hint testing

When to Use

Use it:

  • Cloud storage (S3/GCS/Azure)
  • Multiple processes reading same files
  • Immutable or versioned files
  • Parquet V2 with page indexes (including bundle mode with policy caps)

Skip it:

  • Single-process work (in-memory cache is fine)
  • Local SSD (minimal benefit)
  • Delta/Iceberg/Hudi tables (built-in metadata)
  • Frequently updated files

Project Structure

parx/
├── implementations/java/
│   └── parx/              # Java library
├── implementations/rust/
│   ├── parx/              # Core library
│   └── parx-cli/          # CLI tool
├── benchmarks/
│   └── parx_benchmarks/   # Performance tests
├── spec/
│   └── proto/             # Protobuf schema
└── FORMAT_SPEC.md         # Format specification

Testing

# Unit tests
cargo test

# Java library tests
cd implementations/java/parx
./gradlew test

# Integration tests
cd implementations/rust/parx-cli
cargo test --test cli_integration

License

Apache 2.0

About

PARX stores Parquet footer and page-index metadata in sidecars for faster metadata access across Rust and Java

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors