Search code examples
vaex

Performance Tips for using Vaex


I am using Vaex and looking for performance tips.

My use-case is as follows:

  • I have a large dataframe - let's call it large_df(only a few columns but tens of million rows, and in production, the dataset will be >10x as large). One of the columns is called key, this is a 64 character alphanumeric string. The contents of this dataframe are stored across several HDF5 files. I create the dataframe by doing vaex.open_many(<path/to/hdf5 files/).

  • On each request, the code receives a small (in the 10s) amount of keys to look up in large_df. I then basically have to look up in large_df to get the rows whose keys match the input list of keys, and then do some processing on the resultant matching df (which will be much smaller).

From what I have read, Vaex should be perfect for my use case, however I have struggled to get the performance I was expecting.

My code is essentially this:

import vaex
df = vaex.open_many(</path/to/hdf5 files>)
df = df[df.key.isin(<list of input keys>)].to_pandas_df()

When all the HDF5 files are cached on disk ahead of time, this code takes around 80 seconds on an i3.8xlarge instance. The code runs inside a Docker container with the CPUs capped at 30 (of 32 available). I read the article about how Vaex can handle strings very well, and at first glance this seems like the type of task Vaex should be able to parallelize easily and compute in faster than ~80s.

I have also tried pre-indexing a short_id column into the dataset that comprises large_df. Basically, this is an integer representing the first 4 characters in the key column. I then tried pre-filtering the df before doing the full string comparison. This code looks like the following:

import vaex
df = vaex.open_many(</path/to/hdf5 files>)
short_ids = [alphanumeric_string_to_int(key) for key in <input keys>]
df = df[df.short_id.isin(short_ids)]  # filter df down to a smaller size
df = df[df.key.isin(<list of input keys>)].to_pandas_df()

That shaved off about 10 seconds, but that seems like it should have made things a lot faster. I feel like I am missing something obvious for how to make this blazing fast.

What can I do? Please help - thank you!


Solution

  • yes, this is embarrassingly slow. Vaex' .isin(..) was not being smart here so I solved your problem https://github.com/vaexio/vaex/pull/822. I've seen a 275x speedup for strings. I will make a release to address this when merged.

    Regards,

    Maarten Breddels - vaex.io