Search code examples
awkward-array

Best way to save a dict of awkward1 arrays?


So back in awkward v0 it was possible to do;

import awkward
dog = awkward.fromiter([[1., 2.], [5.]])
cat = awkward.fromiter([[4], [3]])
dict_of_arrays = {'dog': dog, 'cat': cat}
awkward.save("pets.awkd", dict_of_arrays)

Then we could lazy load the array

reloaded_data = awkward.load("pets.awkd")
# no data in ram
double_dog = reloaded_data["dog"]*2
# dog is in ram but not cat

In short with have a dataset consisting of 'dog' and 'cat' parts. The whole dataset saves to one file on disk. Even if I didn't have any documentation, it would be obvious what data is dog and what is cat. Dog and cat load as awkward arrays. I can load the data and work with just one part without the other part ending up in the ram.

I'm looking for the best way to do this in awkward v1. The requirements I would like to meet are;

  • The data consists of multiple named parts, with irregular shapes.
  • All items in one named part have the same data type, different parts may have different data types.
  • Some sort of lazy loading needs to be possible, working on bits of the data as awkward1 arrays without the whole thing.
  • Ideally, the names of the parts are unambiguously associated with the data for each part. Dict structure is good for this, but other things could work.
  • Ideally, the whole dataset saves and loads from one file without speed penalty.
  • Ideally, when the array is loaded it has the right type, so in the example dog is a float array and cat is an int array.

I had a look at awkward1.to_parquet and while it looks good it seems to be just for saving one array. This dosn't fit well with the need to hold multiple data types, and I'm not sure how I'd record the column names. I suppose I could convert back to awkward v0 and save that way but I'm not sure how that would play with lazy loading. It might be that I need to write a wrapper to do these things, which would be totally fine, but I wanted to check first if there is something built in that I should know about.

Edit; the answer given works great. For completeness I wanted to leave an example of using it;

In [1]: import awkward1 as ak

In [2]: dog = ak.from_iter([[1., 2.], [5.]])
   ...: cat = ak.from_iter([[4], [3]])

In [3]: ak.zip?

In [4]: pets = ak.zip({"dog": dog, "cat": cat}, depth_limit=1)

In [5]: pets.dog
Out[5]: <Array [[1, 2], [5]] type='2 * var * float64'>

In [6]: pets.cat
Out[6]: <Array [[4], [3]] type='2 * var * int64'>


In [7]: ak.to_parquet(pets, "pets.parquet")



Solution

  • What Awkward v0 did with awkward0.save is entirely equivalent to pickling (in v0 or v1), so the special name "save" has been removed. (It was inspired by NumPy's "save" and "load," but eventually we just made Awkward's __setstate__ and __getstate__ do the same thing.)

    But picking/old-style saving doesn't lazily load. (Edit: actually, I had forgotten that old-style save does lazily load, but only at top-most granularity—the arrays that are separate in the dict become separate "files" within the ZIP file. Parquet lazily loads subfields of nested records.)

    You're right that ak.to_parquet/ak.from_parquet is a good option for lazy loading, and this file format has a better compression-to-read-speed than our picking format. It's also a standard that many programs recognize. (If you use it, I recommend passing through use_dictionary=False and use_byte_stream_split=True for floating point data; all the options on this page can be supplied to ak.to_parquet as **options. I need to add some documentation explaining how these are good options for floating point.)

    It is also true that ak.to_parquet takes only one array argument. But that's fine: make a single array, not a dict. The fact that Awkward Array manipulates data structures helps you here. You can ak.zip all your arrays together into one array, using the same field names that you would have used as dict keys. If they have different internal structures, you can prevent it from trying to align them at all levels with depth_limit=1, and if they even have different lengths, you can meet then each in a length-1 outer structure with

    has_one_more_dimension = original_array[np.newaxis]
    

    The names that ak.to_parquet uses for column names come from the records of the Awkward Array itself. Different fields in a record can have different data types. Therefore, the names you zip them with are the column names of the Parquet file, and ready column can have a different type.

    Parquet files are lazily loaded by column (including fields of nested records) and by row group. If you want to configure the granularity of reading groups of rows, write the file as a partitioned array (ak.partitioned or ak.repartition).