Search code examples
parquetpyarrowapache-arrow

http request with parquet and pyarrow


I would like to use pyarrow to read/query parquet data from a rest server. At the moment I'm chunking the data, converting to pandas, dumping to json, and streaming the chunks. Like:

p = pq.ParquetDataset('/path/to/data.parquet', filters=filter, use_legacy_dataset=False)
batches = p._dataset.to_batches(filter=p._filter_expression)
(json.dumps(b.to_pandas().values.tolist()) for b in batches)

This is effectively the same work as

ds = pq.ParquetDataset('/path/to/data.parquet',
                       use_legacy_dataset=False,
                       filters=filters)
df = ds.read().to_pandas()
data = pd.DataFrame(orjson.loads(orjson.dumps(df.values.tolist())))

without the network io. It's about 50x slower than just reading to pandas directly

df = ds.read().to_pandas()

Is there a way to serialize the parquet dataset to some binary string that I can send over http and parse on the client side?


Solution

  • You can send your data using the arrow in memory columnar format. It will be much more efficient and compact than json. but bear in mind it will be binary data (which unlike json is not human readable)

    See the doc for a full example.

    In your case you want to do something like this:

    ds = pq.ParquetDataset('/path/to/data.parquet',
                           use_legacy_dataset=False,
                           filters=filters)
    table = ds.read()  # pa.Table
    
    # Write the data:
    batches = table.to_batches()
    sink = pa.BufferOutputStream()
    writer = pa.ipc.new_stream(sink, table.schema)
    for batch in batches:
        writer.write(batch)
    writer.close()
    buf = sink.getvalue()
    
    # Read the data:
    reader = pa.ipc.open_stream(buf)
    read_batches = [b for b in reader]
    read_table = pa.Table.from_batches(read_batches)
    
    read_table.to_pandas()