Search code examples
python-3.xnumpyparquetawkward-array

Awkward Array: How to get numpy array after storing as Parquet (not BitMasked)?


I want to store 2D arrays of different length as an AwkwardArray, store them as Parquet, and later access them again. The problem is that, after loading from Parquet, the format is BitMaskedArray and the access performance is a bit slow. Demonstrated by the following code:

import numpy as np
import awkward as awk

# big to feel performance (imitating big audio file); 2D
np_arr0 = np.arange(20000000, dtype=np.float32).reshape(2, -1)
print(np_arr0.shape)
# (2, 10000000)
# different size
np_arr1 = np.arange(20000000, 36000000, dtype=np.float32).reshape(2, -1)
print(np_arr1.shape)
# (2, 8000000)

# slow; turn into AwkwardArray
awk_arr = awk.fromiter([np_arr0, np_arr1])

# fast; returns np.ndarray
awk_arr[0][0]

# store and load from parquet
awk.toparquet("sample.parquet", awk_arr)
pq_array = awk.fromparquet("sample.parquet")

# kinda slow; return BitMaskedArray
pq_array[0][0]

If we inspect the return, we see:

pq_array[0][0].layout
#  layout 
# [    ()] BitMaskedArray(mask=layout[0], content=layout[1], maskedwhen=False, lsborder=True)
# [     0]   ndarray(shape=1250000, dtype=dtype('uint8'))
# [     1]   ndarray(shape=10000000, dtype=dtype('float32'))

#   trying to access only float32 array [1]
pq_array[0][0][1]
# expected
# array([0.000000e+00, 1.000000e+00, 2.000000e+00, ..., 9.999997e+06, 9.999998e+06, 9.999999e+06], dtype=float32)

# reality
# 1.0

Question

How can I load AwkwardArray from Parquet and quickly access the numpy values?

Info from README (GitHub)

awkward.fromparquet is lazy-loading the Parquet file.

Good that's what will help when doing e.g. pq_array[0][0][:1000]

The next layer of new structure is that the jagged array is bit-masked. Even though none of the values are nullable, this is an artifact of the way Parquet formats columnar data.

I guess there is no way around this. However, is this the reason why loading is kinda slow? Can I still access the data as numpy.ndarray by directly accessing it (no bitmasked)?

Additional attempt

Loading it with Arrow, then Awkward:

import pyarrow as pa
import pyarrow.parquet as pq

# Parquet as Arrow
pa_array = pq.read_table("sample.parquet")

# returns table instead of JaggedArray
awk.fromarrow(pa_array)
# <Table [<Row 0> <Row 1>] at 0x7fd92c83aa90>

Solution

  • In both Arrow and Parquet, all data is nullable, so Arrow/Parquet writers are free to throw in bitmasks wherever they want to. When reading the data back, Awkward has to treat those bitmasks as meaningful (mapping them to awkward.BitMaskedArray), but they might be all valid, particularly if you know that you didn't set any values to null.

    If you're willing to ignore the bitmask, you can reach behind it by calling

    pq_array[0][0].content
    

    As for the slowness, I can say that

    import awkward as ak
    
    # slow; turn into AwkwardArray
    awk_arr = ak.fromiter([np_arr0, np_arr1])
    

    is going to be slow because ak.fromiter is one of the few functions that is implemented with a Python for loop—iterating over 10 million values in a NumPy array with a Python for loop is going to be painful. You can build the same thing manually with

    >>> ak_arr0 = ak.JaggedArray.fromcounts([np_arr0.shape[1], np_arr0.shape[1]],
    ...                                     np_arr0.reshape(-1))
    >>> ak_arr1 = ak.JaggedArray.fromcounts([np_arr1.shape[1], np_arr1.shape[1]],
    ...                                     np_arr1.reshape(-1))
    >>> ak_arr = ak.JaggedArray.fromcounts([len(ak_arr0), len(ak_arr1)],
    ...                                    ak.concatenate([ak_arr0, ak_arr1]))
    

    As for Parquet being slow, I can't say why: it could be related to page size or row group size. Since Parquet is a "medium weight" file format (between "heavyweights" like HDF5 and "lightweights" like npy/npz), it has a few tunable parameters (not a lot).

    You might also want to consider

    ak.save("file.awkd", ak_arr)
    ak_arr2 = ak.load("file.awkd")
    

    which is really just the npy/npz format with JSON metadata to map Awkward arrays to and from flat NumPy arrays. For this sample, the file.awkd is 138 MB.