Search code examples
pythonpandasdataframenumpypandas-groupby

Pandas find common NA records across multiple large dataframes


I have 3 dataframes like as shown below

ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)

ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)

ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)

Now I want to identify the IDs where col1 is NA in all 3 input dataframes.

So, I tried the below

L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]

While the above works, is there any efficient and elegant approach do the above?

As you can see that am repeating the same statement thrice (for 3 dataframes)

However, in my real data, I have 12 dataframes where I have to get IDs where col1 is NA in all 12 dataframes.

update - my current read operation looks like below

fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
    df= pd.read_excel(fname, sheet_name="Sheet1")
    df.columns = df.iloc[7]
    df = df.iloc[8: , :]
    NA_list.append(df[df['col1'].isna()]['ID'])
    dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]

Solution

  • You can use:

    dfs = [df1, df2, df3]
    final_df = pd.concat(dfs).query('col1.isna()')
    final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
    print(final_df)
    
    # Output
       ID col1  col2
    3   4  NaN    34
    3   4  NaN   304
    3   4  NaN   314
    

    Full code:

    fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
    
    def preprocessing(fname):
        return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)
    
    dfs = [preprocessing(fname) for fname in fnames]
    final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
    final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]