Search code examples
pythonpandasparquetpyarrowray

Low-latecy response with Ray on large(isch) dataset


TL;DR

What's the fasted way to get near-zero loading time for a pandas dataset I have in memory, using ray?

Background

I'm making an application which uses semi-large datasets (pandas dataframes between 100MB to 700MB) and are trying to reduce each query time. For a lot of my queries the data loading is the majority of the response times. The datasets are optimized parquet files (categories instead of strings, etc) which only reads the columns it needs.

Currently I use a naive approach that per-requests loads the require dataset (reading the 10-20 columns out of 1000 I need from the dataset) and then filter out the rows I need.

A typical request:

  • Read and parse the contract (~50-100ms)
  • Load the dataset (10-20 columns) (400-1200ms)
  • Execute pandas operations (~50-100ms)
  • Serialise the results (50-100ms)

I'm now trying to speed this up (reduce or remove the load dataset step).

Things I have tried:

  1. Use Arrow's new row-level filtering on the dataset to only read the rows I need as well. This is probably a good way in the future, but for now the new Arrow Dataset API which is relies on is significantly slower than reading the full file using the legacy loader.
  2. Optimize the hell out of the datasets. This works well to a point, where things are in categories, the data types is optimized.
  3. Store the dataframe in Ray. Using ray.put and ray.get. However this doesn't actually improve the situation since the time consuming part is deserialization of the dataframe.
  4. Put the dataset in ramfs. This doesn't actually improve the situation since the time consuming part is deserialization of the dataframe.
  5. Store the object in another Plasma store (outside of ray.put) but obviously the speed is the same (even though I might get some other benefits)

The datasets are parquet files, which is already pretty fast for serialization/deserialization. I typically select about 10-20 columns (out of 1000) and about 30-60% of the rows.

Any good ideas on how to speed up the loading? I haven't been able to find any near zero-copy operations for pandas dataframes (i.e without the serialization penalty).

Things that I am thinking about:

  1. Placing the dataset in an actor, and use one actor per thread. That would probably give the actor direct access to the dataframe without any serialization, but would require me to do a lot of handling of:

    • Making sure I have an actor per thread
    • Distribute requests per threads
    • "Recycle" the actors when the dataset gets updated

Regards, Niklas


Solution

  • After talking to Simon on Slack we found the culprit:

    simon-mo: aha yes objects/strings are not zero copy. categorical or fixed length string works. for fixed length you can try convert them to np.array first

    Experimenting with this (categorical values, fixed length strings etc) allows me not quite get zero-copy but at least fairly low latency (~300ms or less) when using Ray Objects or Plasma store.