What's the fasted way to get near-zero loading time for a pandas dataset I have in memory, using ray?
I'm making an application which uses semi-large datasets (pandas dataframes between 100MB to 700MB) and are trying to reduce each query time. For a lot of my queries the data loading is the majority of the response times. The datasets are optimized parquet files (categories instead of strings, etc) which only reads the columns it needs.
Currently I use a naive approach that per-requests loads the require dataset (reading the 10-20 columns out of 1000 I need from the dataset) and then filter out the rows I need.
I'm now trying to speed this up (reduce or remove the load dataset step).
The datasets are parquet files, which is already pretty fast for serialization/deserialization. I typically select about 10-20 columns (out of 1000) and about 30-60% of the rows.
Any good ideas on how to speed up the loading? I haven't been able to find any near zero-copy operations for pandas dataframes (i.e without the serialization penalty).
Placing the dataset in an actor, and use one actor per thread. That would probably give the actor direct access to the dataframe without any serialization, but would require me to do a lot of handling of:
Regards, Niklas
After talking to Simon on Slack we found the culprit:
simon-mo: aha yes objects/strings are not zero copy. categorical or fixed length string works. for fixed length you can try convert them to np.array first
Experimenting with this (categorical values, fixed length strings etc) allows me not quite get zero-copy but at least fairly low latency (~300ms or less) when using Ray Objects or Plasma store.