Search code examples
pythonarraysmatrixparquetpetastorm

Storing ndarrays into Parquet via uber/petastorm?


Is it possible to store N-dimensional arrays into Parquet via uber/petastorm ?


Solution

  • Yes. Petastorm provides a custom layer of codecs and a schema extension on top of standard Apache Parquet format. The n-dimensional arrays / tensors would be serialized into binary blob fields. From the user perspective, these would look like native types, depends on the environment you work with (pure Python/pyspark: numpy/array, tf.Tensor in Tensorflow or torch Tensors in PyTorch).

    There are some easy to follow examples here: https://github.com/uber/petastorm/tree/master/examples/hello_world/petastorm_dataset