Several questions have been asked and answered similar to mine, however they all used pd.concat
, whereas I would like to use merge
or something that could give me similar results.
I would like to combine multiple CSV files into one master df. All my CSV files have the same two columns ("gene" and "log2foldchange").
My CSV files all look something like the following:
gene | log2foldchange |
---|---|
gene1 | 0.03 |
gene2 | 0.02 |
gene3 | 0.01 |
gene4 | 0.05 |
Each CSV file:
I would like to incorporate the name of the CSV file as the column name that stores the log2foldchange value so that I know where the data came from. I would like my final master df to look something like the following:
left | name_of_1st_CSV_file | name_of_2nd_CSV_file | name_of_3rd_CSV_file | name_of_4th_CSV_file |
---|---|---|---|---|
gene1 | 0.08 | 0.09 | 0.07 | 0.01 |
gene2 | 0.07 | 0.03 | 0.06 | 0.001 |
gene3 | 0.08 | 0.06 | 0.05 | 0.2 |
gene4 | 0.09 | 0.02 | 0.03 | 0.011 |
I am able to successfully use pd.concat
to combine all my CSV files into one master df, however, the structure of my final df is such that all the CSV files were just stacked on top of one another. This structure doesn't make sense for the type of analysis I am trying to do. I would like to use merge
, but the issue that I am running into is that it requires two objects and I am not sure how to get around that.
Also I have only been able to get the file names as their own separate column, not as the column name for the log2fold change as I would like it.
This is the code that I have using pd.concat:
source_files = sorted(Path('path/to/my/files').glob('*.csv'))
dataframes = []
for file in source_files:
df = pd.read_csv(file)
df['source'] = file.name
dataframes.append(df)
df_all = pd.concat(dataframes)
display(df_all)
This results in a df that looks like this:
gene | log2foldchange | source |
---|---|---|
gene1 | 0.03 | 1st CSV file |
gene2 | 0.02 | 1st CSV file |
gene3 | 0.01 | 1st CSV file |
gene4 | 0.04 | 1st CSV file |
gene1 | 0.05 | 2nd CSV file |
gene2 | 0.06 | 2nd CSV file |
etc...
This is the code I have using pd.merge that results in an error because only one object is given:
path = 'path/to/my/files'
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of file paths
#initialize empty data frame
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.merge(li, axis=0, ignore_index=True)
display(frame)
EDITED: Tweaked in response to comments
Would this work?
from functools import reduce
list_of_dfs = []
for file in all_files:
df = pd.read_csv(file, header=0, sep=",") #Change sep to whatever separates your data
df = df.loc[:, ['gene', 'log2foldchange']] #Subset out only the two columns you want
file_name = os.path.splitext(os.path.basename(file))[0] #Get filename without extension
df = df.rename(columns={'log2foldchange': file_name}) #Rename log2fold column with file name
list_of_dfs.append(df) #Add df to list of dfs
df_merged = reduce(lambda left,right: pd.merge(left,right,on='gene',how='outer'), list_of_dfs)