Search code examples
pythonpandasdataframeglobfillna

How to fillna efficiently during glob operation of multiple csv files


I have more than 15 csv files. The csv file 1 to file 4 when loaded as a dataframe look like as shown below

df = pd.DataFrame({'person_id': [101,101,101,101],
                        'Date':['06/08/2014 08:00:00 AM','06/01/2014 05:00:00 AM','06/03/2014 08:00:00 AM','06/09/2014 05:00:00 AM'],
                        'Test':['A1','B2','B5','B6']})

The csv file 5 to file 15 look like as shown below

df1 = pd.DataFrame({'person_id': [201,201,201,201],
                        'Date':[np.nan,np.nan,'06/03/2014 08:00:00 AM','06/09/2014 05:00:00 AM'],
                        'Test':['A1','B2','B5','B6'],
                        'date_backup':['03/01/2014 08:00:00 AM','10/08/2014 08:00:00 AM','09/09/2014 08:00:00 AM',np.nan]})

As you can see from sample dataframe df1 (of file 5 to file 15), that I have two date columns

Now from file 5 to file 15, I would like to fillna in those date_backup or Date by copying value from the other.

For example, I would like to do the below

df1.Date.fillna(df.date_backup, inplace=True)
df1.date_backup.fillna(df.Date, inplace=True)

Now, am not sure how can I do the above fillna() operation in the below code

I tried the below but not sure whether it is efficient and elegant

pat_dir = ['Path1\path2\La*.csv','Path3\Path4\20*.csv']
files_grabbed = []
for files in pat_dir:
    files_grabbed.extend(glob.glob(files))
d = {'date_backup':'Date'}
cols = ['Date','Test','value','person_id']
dfs = [pd.read_csv(f, sep=",",low_memory=False).rename(columns=d).reindex(columns=cols) 
       for f in files_grabbed]   #thanks to Jezrael for this code

In the code above, you can see that I am renaming the date_backup column.

How can I fillna() efficiently and elegantly

I expect my output to be concatenated/appended dataframe of 15 csv files like as shown below (with one Date column). My only question is how can I fillna() before appending dataframes, so I will have only Date column

enter image description here


Solution

  • If need fillna columns in one DataFrame is possible use if statement for fillna if exist column date_backup:

    dfs= []
    for f in files_grabbed:
        df1 = pd.read_csv(f, sep=",",low_memory=False)
        if 'date_backup' in df1.columns:
            df1.Date = df1.Date.fillna(df1.date_backup)
          
        #not sure if necessary rename and reindex
        df1 = df1.rename(columns=d).reindex(columns=cols) 
        dfs.append(df1)
    

    Another idea is add date_backup for each DataFrame and fillna after concat:

    cols = ['Date','Test','value','person_id', 'date_backup']
    dfs = [pd.read_csv(f, sep=",",low_memory=False).rename(columns=d).reindex(columns=cols) 
       for f in files_grabbed] 
    
    
    df = pd.concat(dfs)
    df.Date = df.Date.fillna(df.pop('date_backup'))