Search code examples
pythonnumpycsvparquethdf5

Python large-scale data format for distributed writing, reading, and storing on AWS


I'm trying to figure out the best way to write, read, and store data on AWS using a conventional Python data format. From my various Googling, I wasn't able to find a conclusive list of Big O notation run-times for adding operations. The best thing I could find was a textual overview of the popular file formats, where the author comes to the conclusion that no one format covers all use cases.

My use case is as follows: , but here's the specific use case I'm looking at:

  • I have a dataset which can be indexed by "sample_id". There are 2M to 8M such samples.
  • I would have some function to generate data for each sample. The generated data would be one fixed-length embedding per sample (basically a NumPy or PyTorch array 200 float16 values)
  • The data generation would be happening across multiple GPUs or in a multithreaded context
  • I want to quickly append these arrays in a (key, value) fashion to a into a single (or sharded) big file
    [ideally O(1) insertion time]
  • Ideally, if the key already exists, I'd be overwriting the value
  • Once the data generation is complete, I'd plan on transferring the big file (or sharded files) to an S3 bucket
  • Later on, in another machine, I'll download the data from S3
  • Subsequently, I'd like my modeling code to read out specific embeddings from the big data file (or shards) using a key
    [so ideally O(1) read time]
  • I'm willing to forgo space efficiency if that means improved read/write time

The datasets I'm processing will probably have somewhere between 2 million to 8 million rows (meaning there will be 2M to 8M keys). And the file formats I'm considering as of now are:

  • .npy
  • .hd5
  • .parquet
  • .csv

In my current work flow, I was saving individual NPY files, which is definitely a workaround to the distributed-ness of my problem. But transferring all the data to and from AWS becomes cumbersome when I have to do recursive uploads/downloads.

Would appreciate any advice on the best approach here! Similar to the above, I'd also be curious what would be the best format if the data I wanted to save was a JSON object.


Solution

  • Here's what I would suggest:

    When creating your embeddings, use your current solution, and create lots of little .npy files.

    Once you've created your embeddings, and you know how many there are, pack them into two files.

    The first file is a .npy file with an array containing every embedding, of shape (N, 200). With memory mapping, you can create arrays much bigger than memory.

    The second file is a shelf. This file maps the "sample_id" key into an integer. The integer represents an index into the array.

    When you want to find an embedding, do this:

    array[shelf[sample_id]]
    

    i.e. look up the sample_id in the shelf, then look up the index of the embedding in the array.

    If you memory-map the npy file into memory, and use the shelf, you can load single embeddings into memory without having the entire dataset in memory. However, it would require downloading the entire file from S3 before using it.