I want to use pandas in python to loop through multiple DataFrames and keep only headings from a specified keep_col list. The code results in an error if a DataFrame does not contain a specified heading (KeyError: "['str2'] not in index").
The following pandas code creates 2 example DataFrames with differing column heading names:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(2,5), columns=('A','B','str1','str2','str3'))
df2 = pd.DataFrame(np.random.randn(2,3), columns=('A','B','str1'))
print df1
print df2
output DataFrames
A B str1 str2 str3
-0.152686 0.189076 -1.079168 -0.823674 1.489668
-1.272144 0.694862 0.036248 0.319550 0.782666
A B str1
0.310152 1.302962 -0.284632
1.046044 0.090650 0.861716
The code below results in an error because 'str2' is not in 'df2'.
How can this be modified to ignore a 'keep_col' list string if it is not in a DataFrame heading?
#delete columns
keep_col = ['A','str2'] #need code here to ignore 'str2' when generating 'df2'
new_df1 = df1[keep_col]
new_df2 = df2[keep_col]
print new_df1
print new_df2
This is the desired output:
A str2
-0.152686 -0.823674
-1.272144 0.319550
A
0.310152
1.046044
This example is for simplicity. I will be looping through 100+ .csv files to keep only specified columns.
you can use filter() function in conjunction with RegEx:
In [79]: mask = r'^(?:A|str2)$'
In [80]: df1.filter(regex=mask)
Out[80]:
A str2
0 -1.190226 -0.123637
1 -1.782685 0.219820
In [81]: df2.filter(regex=mask)
Out[81]:
A
0 0.207736
1 -0.013273