Search code examples
pythonpandasmultiple-columnsheading

pandas to loop through DataFrames and keep only specified column headings. Error results if specified heading is not in DataFrame


I want to use pandas in python to loop through multiple DataFrames and keep only headings from a specified keep_col list. The code results in an error if a DataFrame does not contain a specified heading (KeyError: "['str2'] not in index").

The following pandas code creates 2 example DataFrames with differing column heading names:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(2,5), columns=('A','B','str1','str2','str3'))
df2 = pd.DataFrame(np.random.randn(2,3), columns=('A','B','str1'))
print df1
print df2

output DataFrames

 A         B         str1      str2      str3
-0.152686  0.189076 -1.079168 -0.823674  1.489668
-1.272144  0.694862  0.036248  0.319550  0.782666

 A         B         str1
 0.310152  1.302962 -0.284632
 1.046044  0.090650  0.861716

The code below results in an error because 'str2' is not in 'df2'.

How can this be modified to ignore a 'keep_col' list string if it is not in a DataFrame heading?

#delete columns
keep_col = ['A','str2'] #need code here to ignore 'str2' when generating 'df2'
new_df1 = df1[keep_col] 
new_df2 = df2[keep_col]

print new_df1
print new_df2

This is the desired output:

 A          str2    
-0.152686  -0.823674
-1.272144   0.319550

 A       
 0.310152  
 1.046044  

This example is for simplicity. I will be looping through 100+ .csv files to keep only specified columns.


Solution

  • you can use filter() function in conjunction with RegEx:

    In [79]: mask = r'^(?:A|str2)$'
    
    In [80]: df1.filter(regex=mask)
    Out[80]:
              A      str2
    0 -1.190226 -0.123637
    1 -1.782685  0.219820
    
    In [81]: df2.filter(regex=mask)
    Out[81]:
              A
    0  0.207736
    1 -0.013273