Search code examples
pythonpython-3.xpandasdataframeglob

Pandas read csv using wild card pattern for selecting columns


I have 3 csv files

df = pd.DataFrame({'person_id': [11,11,11],'Age':[23,25,28],'Summary':['Test','Test1','Test2']})
df1 = pd.DataFrame({'person_id': [21,22,51],'Age':[26,29,22],'Order Summary':['Tep','Tst1','Tt2'],'Order Summary2':['ppp','Ttt','Tfsa']})
df2 = pd.DataFrame({'person_id': [31,31,41],'Age':[27,20,21],'Order Summary':['Tet','Tt1','Tt2'],'Order Summary1':['Tet','Tt1','Tt2']})

I would like to read all those csv files and create one dataframe.

However, I would like to read only two columns for each file and their names are bit different. I would like to create one final dataframe with two columns person_id and Summary (also named as Order Summary in other csv files)

I don't wish to read Age, Order Summary1 or Order Summary2 columns for other csv files.

Basically use regex/pattern matching only to read Summary|Order Summary columns while creating final dataframe

I was trying something like below from SO post

col_list = ["person_id", "Summary"]  # but here i don't know how to use regex
files = glob.glob("file*.csv")
dfs = [pd.read_csv(f,usecols = col_list, header=None, sep=";") for f in files]

meddata = pd.concat(dfs,ignore_index=True)

Can help me on how to use regex to select columns while reading csvs?

I expect my final dataframe to have columns like as shown below you can see how only those 2 required columns from each csv file are selected and concatenated)

Person_id   Summary
11           Test    
11           Test1
11           Test2
21           Tep
22           Tst1
51           Tt2 
31           Tet
31           Tt1
41           Tt2 

Solution

  • Simplier is rename columns Order Summary first and then selecting only expected 2 columns:

    dfs = [pd.read_csv(f, sep=";").rename(columns={'Order Summary':'Summary'})[['person_id','Summary']]
           for f in files] 
    

    Old answer:

    Use DataFrame.filter by regex for match person_id and Summary with end of strings values:

    print (df.filter(regex='person_id|Summary$'))
       person_id Summary
    0         11    Test
    1         11   Test1
    2         11   Test2
    
    print (df1.filter(regex='person_id|Summary$'))
      person_id Order Summary
    0         21           Tep
    1         22          Tst1
    2         51           Tt2
    
    print (df2.filter(regex='person_id|Summary$'))
       person_id Order Summary
    0         31           Tet
    1         31           Tt1
    2         41           Tt2
    

    Another idea is use Index.intersection by list of possible values:

    print (df[df.columns.intersection(['person_id', 'Summary', 'Order Summary'])])
    print (df1[df1.columns.intersection(['person_id', 'Summary', 'Order Summary'])])
    print (df2[df2.columns.intersection(['person_id', 'Summary', 'Order Summary'])])
     
            
    

    So in your solution also add rename for output 2 columns DataFrame:

    dfs = [pd.read_csv(f, sep=";").filter(regex='person_id|Summary$').rename(columns={'Order Summary':'Summary'})
           for f in files] 
    

    For second need:

    dfs = []       
    for f in files:
        df = pd.read_csv(f, sep=";")
        df1 = df[df.columns.intersection(['person_id', 'Summary', 'Order Summary'])].rename(columns={'Order Summary':'Summary'})
        dfs.append(df1)