Search code examples
pythonpandasmergeconcatenation

Combine multiple CSV files using merge while retaining filename information


Several questions have been asked and answered similar to mine, however they all used pd.concat, whereas I would like to use merge or something that could give me similar results.

I would like to combine multiple CSV files into one master df. All my CSV files have the same two columns ("gene" and "log2foldchange").

My CSV files all look something like the following:

gene log2foldchange
gene1 0.03
gene2 0.02
gene3 0.01
gene4 0.05

Each CSV file:

  • has the exact same 4 genes and column names
  • has a different name (obviously)

I would like to incorporate the name of the CSV file as the column name that stores the log2foldchange value so that I know where the data came from. I would like my final master df to look something like the following:

left name_of_1st_CSV_file name_of_2nd_CSV_file name_of_3rd_CSV_file name_of_4th_CSV_file
gene1 0.08 0.09 0.07 0.01
gene2 0.07 0.03 0.06 0.001
gene3 0.08 0.06 0.05 0.2
gene4 0.09 0.02 0.03 0.011

I am able to successfully use pd.concat to combine all my CSV files into one master df, however, the structure of my final df is such that all the CSV files were just stacked on top of one another. This structure doesn't make sense for the type of analysis I am trying to do. I would like to use merge, but the issue that I am running into is that it requires two objects and I am not sure how to get around that.

Also I have only been able to get the file names as their own separate column, not as the column name for the log2fold change as I would like it.

This is the code that I have using pd.concat:

source_files = sorted(Path('path/to/my/files').glob('*.csv'))

dataframes = []
for file in source_files:
    df = pd.read_csv(file) 
    df['source'] = file.name
    dataframes.append(df)

df_all = pd.concat(dataframes)
display(df_all)

This results in a df that looks like this:

gene log2foldchange source
gene1 0.03 1st CSV file
gene2 0.02 1st CSV file
gene3 0.01 1st CSV file
gene4 0.04 1st CSV file
gene1 0.05 2nd CSV file
gene2 0.06 2nd CSV file

etc...

This is the code I have using pd.merge that results in an error because only one object is given:

path = 'path/to/my/files'
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of file paths 

#initialize empty data frame 
li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.merge(li, axis=0, ignore_index=True)
display(frame)

Solution

  • EDITED: Tweaked in response to comments

    Would this work?

    from functools import reduce
    
    list_of_dfs = []
    
    for file in all_files:
        df = pd.read_csv(file, header=0, sep=",") #Change sep to whatever separates your data
        df = df.loc[:, ['gene', 'log2foldchange']] #Subset out only the two columns you want
        file_name = os.path.splitext(os.path.basename(file))[0] #Get filename without extension
        df = df.rename(columns={'log2foldchange': file_name}) #Rename log2fold column with file name
        list_of_dfs.append(df) #Add df to list of dfs
    
    df_merged = reduce(lambda left,right: pd.merge(left,right,on='gene',how='outer'), list_of_dfs)