I'm reading data from a large text file (a VCF) into a zarr array. The overall flow of the code is
with zarr.LMDBStore(...) as store:
array = zarr.create(..., chunks=(1000,1000), store=store, ...)
for line_num, line in enumerate(text_file):
array[line_num, :] = process_data(line)
I'm wondering - when does zarr compress the modified chunks of the array and push them to the underlying store (in this case LMDB)? Does it do that every time a chunk is updated (i.e. each line)? Or does it wait till a chunk is filled/evicted from memory before doing that? Assuming that I need to process each line separately in a for loop (that there aren't efficient array operations to use here due to the nature of the data and processing), is there any optimization I should do here with regards to how I feed the data into Zarr?
I just don't want Zarr running compression on each modified chunk every line when each chunk will be modified 1000 times before being complete and ready to save to disk.
Thanks!
Every time you execute this line:
array[line_num, :] = process_data(line)
...zarr will (1) figure out which chunks overlap the array region you want to write to, (2) retrieve those chunks from the store, (3) decompress the chunks, (4) modify the data, (5) compress the modified chunks, (6) write the modified compressed chunks to the store.
This will happen regardless of what type of underlying storage you are using.
If you have created an array with chunks that are more than one row tall, then this will likely be inefficient, resulting in each chunk being read, decompressed, updated, compressed and written many times.
A better strategy would be to parse your input file in blocks of N lines, where N is equal to the number of rows in each chunk of the output array, so that each chunk is only compressed and written once.
If by VCF you mean Variant Call Format files, you might want to look at the vcf_to_zarr function implementation in scikit-allel.