Search code examples
pythonpython-3.xpandasdataframedefault-value

Efficient way to fill default values for multiple columns on Big size dataframe


I have a dataframe obs of size 1.5 million records. I would like to fill in NA's with default values as shown below.

obs = pd.DataFrame({'person_id' :[1,2,3],'obs_date':['12/31/2007','11/25/2009',np.nan],
       'hero_id':[2,4,np.nan],'date2':['12/31/2017',np.nan,'10/06/2015'],
       'heroine_id':[1,np.nan,5],'date3':['12/31/2027','11/25/2029',np.nan],
       'bud_source_value':[1250000,250000,np.nan],
       'prod__source_value':[10000,20000,np.nan]})

The logic is to fill 3 default values based on column name.

1) cols ending with id - fillna with 0

2) cols ending with value - fillna with ' ' (blank/empty)

3) cols containing date - fillna with 12/31/2000

Though my code below works fine (based on SO suggestion), Is there anyway to fasten it?

%%timeit
c = obs.columns.str
c1 = c.endswith('id')
c2 = c.endswith('value')
c3 = c.contains('date')

obs_final = np.select([c1,c2,c3], [obs.fillna(0), obs.fillna(''), 
obs.fillna("12/31/2000")])
obs_final = pd.DataFrame(obs_final, columns=obs.columns)

It takes 19.5 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - Is this normal?

Is there anyway to improve it?


Solution

  • Select columns with DataFrame.loc and use fillna:

    obs.loc[:, c1] = obs.loc[:, c1].fillna(0)
    obs.loc[:, c2] = obs.loc[:, c2].fillna('')
    obs.loc[:, c3] = obs.loc[:, c3].fillna("12/31/2000")