Fast way to sample a Dask data frame (Python)

I have a huge file that I read with Dask (Python). The file is around 6 million rows and 550 columns. I would like to select a random sample of 5000 records (without replacement). Here are the 2 methods that I tried, but it takes a huge amount of time to run (I stopped after more than 13 hours):


df_s=df.sample(frac=5000/len(df), replace=None, random_state=10)


NSAMPLES=5000
samples = np.random.choice(df.index, size=NSAMPLES, replace=False)
df_s=df.loc[samples]

I am not sure that these are appropriate methods for Dask data frames. Is there a faster way to select records randomly for huge data frames?

Solution

I think the problem might be coming from the len(df) in your first example.

When the len is triggered on the dask dataframe, it tries to compute the total number of rows, which I think might be what's slowing you down.

If you know the length of the dataframe is 6M rows, then I'd suggest changing your first example to be something similar to:

num_rows = len(df)
df_s=df.sample(frac=5000/num_rows, replace=None, random_state=10)

df_s=df.sample(frac=5000/6_000_000, replace=None, random_state=10)

If you're absolutely sure you want to use len(df), you might want to consider how you're loading up the dask dataframe in the first place.

For example, if you're reading a single CSV file on disk, then it'll take a fairly long time since the data you'll be working with (assuming all numerical data for the sake of this, and 64-bit float/int data) = 6 Million Rows * 550 Columns * 8 bytes = 26.4 GB.

This is because dask is forced to read all of the data when it's in a CSV format. The problem gets even worse when you consider working with str or some other data type, and you then have to consider disk read the time.

In comparison, working with parquet becomes much easier since the parquet stores file metadata, which generally speeds up the process, and I believe much less data is read.