So back in awkward v0 it was possible to do;
import awkward
dog = awkward.fromiter([[1., 2.], [5.]])
cat = awkward.fromiter([[4], [3]])
dict_of_arrays = {'dog': dog, 'cat': cat}
awkward.save("pets.awkd", dict_of_arrays)
Then we could lazy load the array
reloaded_data = awkward.load("pets.awkd")
# no data in ram
double_dog = reloaded_data["dog"]*2
# dog is in ram but not cat
In short with have a dataset consisting of 'dog' and 'cat' parts. The whole dataset saves to one file on disk. Even if I didn't have any documentation, it would be obvious what data is dog and what is cat. Dog and cat load as awkward arrays. I can load the data and work with just one part without the other part ending up in the ram.
I'm looking for the best way to do this in awkward v1. The requirements I would like to meet are;
float
array and cat is an int
array.I had a look at awkward1.to_parquet
and while it looks good it seems to be just for saving one array. This dosn't fit well with the need to hold multiple data types, and I'm not sure how I'd record the column names.
I suppose I could convert back to awkward v0 and save that way but I'm not sure how that would play with lazy loading. It might be that I need to write a wrapper to do these things, which would be totally fine, but I wanted to check first if there is something built in that I should know about.
Edit; the answer given works great. For completeness I wanted to leave an example of using it;
In [1]: import awkward1 as ak
In [2]: dog = ak.from_iter([[1., 2.], [5.]])
...: cat = ak.from_iter([[4], [3]])
In [3]: ak.zip?
In [4]: pets = ak.zip({"dog": dog, "cat": cat}, depth_limit=1)
In [5]: pets.dog
Out[5]: <Array [[1, 2], [5]] type='2 * var * float64'>
In [6]: pets.cat
Out[6]: <Array [[4], [3]] type='2 * var * int64'>
In [7]: ak.to_parquet(pets, "pets.parquet")
What Awkward v0 did with awkward0.save
is entirely equivalent to pickling (in v0 or v1), so the special name "save" has been removed. (It was inspired by NumPy's "save" and "load," but eventually we just made Awkward's __setstate__
and __getstate__
do the same thing.)
But picking/old-style saving doesn't lazily load. (Edit: actually, I had forgotten that old-style save does lazily load, but only at top-most granularity—the arrays that are separate in the dict become separate "files" within the ZIP file. Parquet lazily loads subfields of nested records.)
You're right that ak.to_parquet
/ak.from_parquet
is a good option for lazy loading, and this file format has a better compression-to-read-speed than our picking format. It's also a standard that many programs recognize. (If you use it, I recommend passing through use_dictionary=False
and use_byte_stream_split=True
for floating point data; all the options on this page can be supplied to ak.to_parquet
as **options
. I need to add some documentation explaining how these are good options for floating point.)
It is also true that ak.to_parquet
takes only one array argument. But that's fine: make a single array, not a dict. The fact that Awkward Array manipulates data structures helps you here. You can ak.zip all your arrays together into one array, using the same field names that you would have used as dict keys. If they have different internal structures, you can prevent it from trying to align them at all levels with depth_limit=1
, and if they even have different lengths, you can meet then each in a length-1 outer structure with
has_one_more_dimension = original_array[np.newaxis]
The names that ak.to_parquet uses for column names come from the records of the Awkward Array itself. Different fields in a record can have different data types. Therefore, the names you zip them with are the column names of the Parquet file, and ready column can have a different type.
Parquet files are lazily loaded by column (including fields of nested records) and by row group. If you want to configure the granularity of reading groups of rows, write the file as a partitioned array (ak.partitioned or ak.repartition).