I had the previous question closed so reposting it with more context. I was running this command on a dataset that is relatively large (59 gb). With a shape of (800,000, 10,500)
, I noticed that running on my aws ec2 instance df.fillna(df.mean())
was taking extraordinarily long, after 4 hours I just cancelled the cell from running. Is there a faster way of calculating the average and filling each nan
value of each column?
Here is a sample set of the data
d = {'B19325_038E': {409606: 9.0, 403811: 53.0, 400166: 17.0, 402573: 105.0, 400130: 43.0, 404907: 21.0, 406751: 15.0, 403850: 39.0, 404089: 81.0, 409843: np.nan}, 'B08302_014E': {409606: 2.0, 403811: 156.0, 400166: 64.0, 402573: 211.0, 400130: 140.0, 404907: 90.0, 406751: 148.0, 403850: 71.0, 404089: 341.0, 409843: 91.0}, 'B17010I_026E': {409606: np.nan, 403811: 9.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: 21.0}, 'B17015_009E': {409606: 30.0, 403811: 18.0, 400166: 12.0, 402573: 5.0, 400130: 6.0, 404907: 11.0, 406751: 23.0, 403850: 49.0, 404089: 37.0, 409843: 60.0}, 'B06003_004E': {409606: 1552.0, 403811: 3562.0, 400166: 2536.0, 402573: 4911.0, 400130: 1913.0, 404907: 1888.0, 406751: 4264.0, 403850: 2087.0, 404089: 1443.0, 409843: 867.0}, 'B15001_038E': {409606: 46.0, 403811: 104.0, 400166: 89.0, 402573: 120.0, 400130: 61.0, 404907: 14.0, 406751: 60.0, 403850: 198.0, 404089: 97.0, 409843: 25.0}, 'B08130_006E': {409606: 280.0, 403811: 2325.0, 400166: 1381.0, 402573: 2907.0, 400130: 1300.0, 404907: 1528.0, 406751: 2502.0, 403850: 1278.0, 404089: 1986.0, 409843: 308.0}, 'B19201_002E': {409606: 80.0, 403811: 75.0, 400166: 24.0, 402573: 54.0, 400130: np.nan, 404907: np.nan, 406751: 43.0, 403850: 62.0, 404089: 32.0, 409843: 33.0}, 'B19325_087E': {409606: 35.0, 403811: 29.0, 400166: 33.0, 402573: 72.0, 400130: 20.0, 404907: np.nan, 406751: 39.0, 403850: 40.0, 404089: 40.0, 409843: 5.0}, 'B06003_008E': {409606: 106.0, 403811: 458.0, 400166: 296.0, 402573: 505.0, 400130: 277.0, 404907: 804.0, 406751: 1037.0, 403850: 726.0, 404089: 1854.0, 409843: 80.0}, 'B16006_003E': {409606: 30.0, 403811: 525.0, 400166: 160.0, 402573: 33.0, 400130: 386.0, 404907: 2.0, 406751: 55.0, 403850: 121.0, 404089: 686.0, 409843: 228.0}, 'C14007A_004E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_005E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_003E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C21001I_003E': {409606: 31.0, 403811: 287.0, 400166: 86.0, 402573: 25.0, 400130: 235.0, 404907: 35.0, 406751: 32.0, 403850: 73.0, 404089: 384.0, 409843: 84.0}, 'C21001I_006E': {409606: np.nan, 403811: 35.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: 13.0, 403850: 17.0, 404089: 19.0, 409843: 6.0}}
df = pd.DataFrame(data=d)
Here is a picture of my machine using htop
to show you the status of where it is at when it is running df.fillna(df.mean()
As you can see it appears to be working, but I do not see the memory fluctuating at all and thus could be frozen? Hard to tell and a waste of money to let it just keep going more than 4 hours.
Is there a way of parallelizing df.fillna(df.mean())
to make it run faster?
To provide even more context here is what I am currently trying since up until now, no one seems to know.
def fill_nan(df, col):
df[col].fillna(df[col].mean(),inplace=True)
return df
col_list=all_data.columns.tolist()
l = Parallel(n_jobs=-1)(delayed(fill_nan)(df=all_data,col=cols) for cols in col_list)
The problem with this is I am getting this error
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11)}
Despite the error, would this method actually make the computation faster?
As per experience, .fillna()
with all columns is more expensive than applying it selectively with the columns which have nan
. As a matter of fact, observe the results on the following two functions:
def fill_nan1(df):
col_list = df.columns.tolist()
for col in col_list:
df[col].fillna(df[col].mean(),inplace=True)
return df
def fill_nan2(df):
for col in df.columns[df.isnull().any(axis=0)]:
df[col].fillna(df[col].mean(),inplace=True)
return df
.fillna()
applied to all columns of df
in fill_nan1()
(how it is done in your case) while it is applied only with columns with nan
in fill_nan2()
. timeit
on both resulted in the following:
>>%timeit fill_nan1(df)
2.35 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>%timeit fill_nan2(df)
938 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Also if this is for ML purpose, split your data into train and test before filling your nan
values because that doesn't only make your computation faster but also avoid you from wrong imputation.