Search code examples
pythonpandasdataframesamplingbigdata

Quickly sampling large number of rows from large dataframes in python


I have a very large dataframe (about 1.1M rows) and I am trying to sample it.

I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe.

This is what Ive tried so far but all these methods are taking way too much time:

Method 1 - Using pandas :

sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]

Method 2 :

I tried to write all the sampled lines to another csv.

f = open("data.csv",'r')

out  = open("sampled_date.csv", 'w')
out.write(f.readline())

while 1:
    total += 1
    line = f.readline().strip()

    if line =='':
        break
    arr = line.split(",")

    if (int(arr[0]) in sample_index_array):
        out.write(",".join(e for e in (line)))

Can anyone please suggest a better method? Or how I can modify this to make it faster?

Thanks


Solution

  • We don't have your data, so here is an example with two options:

    1. after reading: use a pandas Index object to select a subset via the .iloc selection method
    2. while reading: a predicate with the skiprows parameter

    Given

    A collection of indices and a (large) sample DataFrame written to test.csv:

    import pandas as pd
    import numpy as np
    
    
    indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]
    
    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
    df.to_csv("test.csv", header=False)
    df.info()
    

    Output

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000000 entries, 0 to 999999
    Data columns (total 4 columns):
    A    1000000 non-null int32
    B    1000000 non-null int32
    C    1000000 non-null int32
    D    1000000 non-null int32
    dtypes: int32(4)
    memory usage: 15.3 MB
    

    Code

    Option 1 - after reading

    Convert a sample list of indices to an Index object and slice the loaded DataFrame:

    idxs = pd.Index(indices)   
    subset = df.iloc[idxs, :]
    print(subset)
    

    The .iat and .at methods are even faster, but require scalar indices.


    Option 2 - while reading (Recommended)

    We can write a predicate that keeps selected indices as the file is being read (more efficient):

    pred = lambda x: x not in indices
    data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
    print(data)
    

    See also the issue that led to extending skiprows.


    Results

    The same output is produced from the latter options:

            A   B   C   D
    1      74  95  28   4
    2      87   3  49  94
    3      53  54  34  97
    10     58  41  48  15
    20     86  20  92  11
    30     36  59  22   5
    67     49  23  86  63
    78     98  63  60  75
    900    26  11  71  85
    2176   12  73  58  91
    78776  42  30  97  96