Search code examples
pythonpandasparallel-processingjoblib

A faster way to fill nan values with average


I had the previous question closed so reposting it with more context. I was running this command on a dataset that is relatively large (59 gb). With a shape of (800,000, 10,500), I noticed that running on my aws ec2 instance df.fillna(df.mean()) was taking extraordinarily long, after 4 hours I just cancelled the cell from running. Is there a faster way of calculating the average and filling each nan value of each column?

Here is a sample set of the data

d = {'B19325_038E': {409606: 9.0, 403811: 53.0, 400166: 17.0, 402573: 105.0, 400130: 43.0, 404907: 21.0, 406751: 15.0, 403850: 39.0, 404089: 81.0, 409843: np.nan}, 'B08302_014E': {409606: 2.0, 403811: 156.0, 400166: 64.0, 402573: 211.0, 400130: 140.0, 404907: 90.0, 406751: 148.0, 403850: 71.0, 404089: 341.0, 409843: 91.0}, 'B17010I_026E': {409606: np.nan, 403811: 9.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: 21.0}, 'B17015_009E': {409606: 30.0, 403811: 18.0, 400166: 12.0, 402573: 5.0, 400130: 6.0, 404907: 11.0, 406751: 23.0, 403850: 49.0, 404089: 37.0, 409843: 60.0}, 'B06003_004E': {409606: 1552.0, 403811: 3562.0, 400166: 2536.0, 402573: 4911.0, 400130: 1913.0, 404907: 1888.0, 406751: 4264.0, 403850: 2087.0, 404089: 1443.0, 409843: 867.0}, 'B15001_038E': {409606: 46.0, 403811: 104.0, 400166: 89.0, 402573: 120.0, 400130: 61.0, 404907: 14.0, 406751: 60.0, 403850: 198.0, 404089: 97.0, 409843: 25.0}, 'B08130_006E': {409606: 280.0, 403811: 2325.0, 400166: 1381.0, 402573: 2907.0, 400130: 1300.0, 404907: 1528.0, 406751: 2502.0, 403850: 1278.0, 404089: 1986.0, 409843: 308.0}, 'B19201_002E': {409606: 80.0, 403811: 75.0, 400166: 24.0, 402573: 54.0, 400130: np.nan, 404907: np.nan, 406751: 43.0, 403850: 62.0, 404089: 32.0, 409843: 33.0}, 'B19325_087E': {409606: 35.0, 403811: 29.0, 400166: 33.0, 402573: 72.0, 400130: 20.0, 404907: np.nan, 406751: 39.0, 403850: 40.0, 404089: 40.0, 409843: 5.0}, 'B06003_008E': {409606: 106.0, 403811: 458.0, 400166: 296.0, 402573: 505.0, 400130: 277.0, 404907: 804.0, 406751: 1037.0, 403850: 726.0, 404089: 1854.0, 409843: 80.0}, 'B16006_003E': {409606: 30.0, 403811: 525.0, 400166: 160.0, 402573: 33.0, 400130: 386.0, 404907: 2.0, 406751: 55.0, 403850: 121.0, 404089: 686.0, 409843: 228.0}, 'C14007A_004E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_005E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_003E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C21001I_003E': {409606: 31.0, 403811: 287.0, 400166: 86.0, 402573: 25.0, 400130: 235.0, 404907: 35.0, 406751: 32.0, 403850: 73.0, 404089: 384.0, 409843: 84.0}, 'C21001I_006E': {409606: np.nan, 403811: 35.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: 13.0, 403850: 17.0, 404089: 19.0, 409843: 6.0}}

df = pd.DataFrame(data=d)

Here is a picture of my machine using htop to show you the status of where it is at when it is running df.fillna(df.mean()

enter image description here

As you can see it appears to be working, but I do not see the memory fluctuating at all and thus could be frozen? Hard to tell and a waste of money to let it just keep going more than 4 hours.

Is there a way of parallelizing df.fillna(df.mean()) to make it run faster?

To provide even more context here is what I am currently trying since up until now, no one seems to know.

def fill_nan(df, col):
    df[col].fillna(df[col].mean(),inplace=True)
    return df

col_list=all_data.columns.tolist()
l = Parallel(n_jobs=-1)(delayed(fill_nan)(df=all_data,col=cols) for cols in col_list)

The problem with this is I am getting this error

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

Despite the error, would this method actually make the computation faster?


Solution

  • As per experience, .fillna() with all columns is more expensive than applying it selectively with the columns which have nan. As a matter of fact, observe the results on the following two functions:

    def fill_nan1(df):
        col_list = df.columns.tolist()
        for col in col_list:
            df[col].fillna(df[col].mean(),inplace=True)
        return df
    
    def fill_nan2(df):
        for col in df.columns[df.isnull().any(axis=0)]:
            df[col].fillna(df[col].mean(),inplace=True)
        return df
    

    .fillna() applied to all columns of df in fill_nan1() (how it is done in your case) while it is applied only with columns with nan in fill_nan2(). timeit on both resulted in the following:

    >>%timeit fill_nan1(df)
    2.35 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    >>%timeit fill_nan2(df)
    938 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Also if this is for ML purpose, split your data into train and test before filling your nan values because that doesn't only make your computation faster but also avoid you from wrong imputation.