Skip to content

0x26res/ptars

ptars

PyPI Version Python Version PyPI Wheel Documentation Downloads Downloads Crates.io Crates.io Downloads docs.rs Build Status codecov License Ruff snyk Github Stars GitHub issues GitHub Release Release Date Last Commit Commit Activity Open PRs Contributors Contributing FOSSA Status Repo Size Rust Apache Arrow prek

Repository | Python Documentation | Python Installation | PyPI | Rust Crate | Rust Documentation

Fast conversion between Protocol Buffers and Apache Arrow, using Rust, with Python bindings.

ptars converts directly between the protobuf wire format and Arrow columnar arrays. No intermediate message objects are created. Serialized protobuf bytes are parsed straight into Arrow builders. And Arrow arrays are encoded directly to protobuf wire format, skipping the overhead of DynamicMessage or any per-row object allocation.

Example

Take a protobuf:

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}

And convert serialized messages directly to pyarrow.RecordBatch:

from ptars import HandlerPool


messages = [
    SearchRequest(
        query="protobuf to arrow",
        page_number=0,
        result_per_page=10,
    ),
    SearchRequest(
        query="protobuf to arrow",
        page_number=1,
        result_per_page=10,
    ),
]
payloads = [message.SerializeToString() for message in messages]

pool = HandlerPool([SearchRequest.DESCRIPTOR.file])
handler = pool.get_for_message(SearchRequest.DESCRIPTOR)
record_batch = handler.list_to_record_batch(payloads)
query page_number result_per_page
protobuf to arrow 0 10
protobuf to arrow 1 10

You can also convert a pyarrow.RecordBatch back to serialized protobuf messages:

array: pa.BinaryArray = handler.record_batch_to_array(record_batch)
messages_back: list[SearchRequest] = [
    SearchRequest.FromString(s.as_py()) for s in array
]

Configuration

Customize Arrow type mappings with PtarsConfig:

from ptars import HandlerPool, PtarsConfig

config = PtarsConfig(
    timestamp_unit="us",  # microseconds instead of nanoseconds
    timestamp_tz="America/New_York",
)

pool = HandlerPool([SearchRequest.DESCRIPTOR.file], config=config)

Benchmark against protarrow

Ptars is a Rust implementation of protarrow, which is implemented in plain Python. By encoding and decoding directly between protobuf wire format and Arrow arrays, ptars is:

  • 7x+ faster when converting from proto to Arrow.
  • 30x+ faster when converting from Arrow to proto.
---- benchmark 'to_arrow': 2 tests ----
Name (time in us)        Mean
---------------------------------------
ptars_to_arrow          659 (1.0)
protarrow_to_arrow    5,037 (7.65)
---------------------------------------

---- benchmark 'to_proto': 2 tests -----
Name (time in us)         Mean
----------------------------------------
ptars_to_proto           397 (1.0)
protarrow_to_proto    12,534 (31.61)
----------------------------------------

About

Protobuf to Arrow, using Rust

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors