Search code examples
pandasmultiple-columnsparquet

Returning all the column names as lists from multiple Parquet Files in Python


I have more than 100 Parquet files in a folder. I am not sure if all the files are having same feature name(column name). I want to write some python codes, through pandas which could read all the file in directory and return the name of columns with file name as prefix.

I tried 'for loop', but not sure how to structure the query. Being a beginner I could not write looped script.

import glob
path = r'C:\Users\NewFOlder1\NewFOlder\Folder' 
all_files = glob.glob(path + '\*.gzip')

col=[]
for paths in all_files:
    
    df=pd.read_parquet(paths)
    col.append(df.columns)
    print(col)

Solution

  • IIUC, use pandas.concat with pandas.DataFrame.columns :

    import glob
    import pandas as pd
    
    path = r'C:\Users\NewFOlder1\NewFOlder\Folder' 
    all_files = glob.glob(path + '\*.gzip')
    
    list_dfs = []
    for paths in all_files:
        df = pd.read_parquet(paths)
        list_dfs.append(df)
        
    col_names = pd.concat(list_dfs).columns.tolist()