I have 3 csv files
df = pd.DataFrame({'person_id': [11,11,11],'Age':[23,25,28],'Summary':['Test','Test1','Test2']})
df1 = pd.DataFrame({'person_id': [21,22,51],'Age':[26,29,22],'Order Summary':['Tep','Tst1','Tt2'],'Order Summary2':['ppp','Ttt','Tfsa']})
df2 = pd.DataFrame({'person_id': [31,31,41],'Age':[27,20,21],'Order Summary':['Tet','Tt1','Tt2'],'Order Summary1':['Tet','Tt1','Tt2']})
I would like to read all those csv files and create one dataframe.
However, I would like to read only two columns for each file and their names are bit different. I would like to create one final dataframe with two columns person_id
and Summary
(also named as Order Summary
in other csv files)
I don't wish to read Age
, Order Summary1
or Order Summary2
columns for other csv files.
Basically use regex/pattern matching only to read Summary|Order Summary
columns while creating final dataframe
I was trying something like below from SO post
col_list = ["person_id", "Summary"] # but here i don't know how to use regex
files = glob.glob("file*.csv")
dfs = [pd.read_csv(f,usecols = col_list, header=None, sep=";") for f in files]
meddata = pd.concat(dfs,ignore_index=True)
Can help me on how to use regex to select columns while reading csvs?
I expect my final dataframe to have columns like as shown below you can see how only those 2 required columns from each csv file are selected and concatenated)
Person_id Summary
11 Test
11 Test1
11 Test2
21 Tep
22 Tst1
51 Tt2
31 Tet
31 Tt1
41 Tt2
Simplier is rename columns Order Summary
first and then selecting only expected 2 columns:
dfs = [pd.read_csv(f, sep=";").rename(columns={'Order Summary':'Summary'})[['person_id','Summary']]
for f in files]
Old answer:
Use DataFrame.filter
by regex for match person_id
and Summary
with end of strings values:
print (df.filter(regex='person_id|Summary$'))
person_id Summary
0 11 Test
1 11 Test1
2 11 Test2
print (df1.filter(regex='person_id|Summary$'))
person_id Order Summary
0 21 Tep
1 22 Tst1
2 51 Tt2
print (df2.filter(regex='person_id|Summary$'))
person_id Order Summary
0 31 Tet
1 31 Tt1
2 41 Tt2
Another idea is use Index.intersection
by list of possible values:
print (df[df.columns.intersection(['person_id', 'Summary', 'Order Summary'])])
print (df1[df1.columns.intersection(['person_id', 'Summary', 'Order Summary'])])
print (df2[df2.columns.intersection(['person_id', 'Summary', 'Order Summary'])])
So in your solution also add rename
for output 2 columns DataFrame:
dfs = [pd.read_csv(f, sep=";").filter(regex='person_id|Summary$').rename(columns={'Order Summary':'Summary'})
for f in files]
For second need:
dfs = []
for f in files:
df = pd.read_csv(f, sep=";")
df1 = df[df.columns.intersection(['person_id', 'Summary', 'Order Summary'])].rename(columns={'Order Summary':'Summary'})
dfs.append(df1)