I have multiple one column CSV files which I use pandas
to concatenate and find the mean:
#!/usr/bin/env python3
import os
import glob
import pandas as pd
from sys import argv, exit
os.chdir(f'{argv[1]}')
filenames = [i for i in glob.glob('*.csv')]
comb = pd.concat([pd.read_csv(f, header=None) for f in filenames], axis=1)
comb.columns = [f'Run {i+1}' for i in range(len(comb.columns))]
comb['Mean'] = comb.mean(numeric_only=True, axis=1)
comb.insert(loc=0, column='Epoch', value=[i+1 for i in range(len(comb))])
comb.to_csv(f'{argv[2]}', index=False)
Where argv[1]
is the directory to concatenate and argv[2]
is where to save the concatenated file. An example input might be:
0.6932
0.6605
0.634
0.5904
0.5339
And an example output:
Epoch Run 1 Run 2 Run 3 Run 4 Run 5 Mean
1 0.6932 0.6711 0.6687 0.6814 0.6903 0.684856667
2 0.6605 0.6326 0.6423 0.6323 0.6247 0.639423333
3 0.634 0.6003 0.6063 0.5928 0.5786 0.603956667
4 0.5904 0.5604 0.5576 0.5448 0.5276 0.559693333
5 0.5339 0.519 0.5068 0.496 0.4813 0.513653333
This is concatenating five one column, five row CSV files.
How can I drop columns from this table which have the highest number for the final row? Say I want to drop two columns, in this case Run 1
and Run 2
. A desired output might be:
Epoch Run 3 Run 4 Run 5 Mean
1 0.6687 0.6814 0.6903 0.680133333
2 0.6423 0.6323 0.6247 0.6331
3 0.6063 0.5928 0.5786 0.592566667
4 0.5576 0.5448 0.5276 0.543333333
5 0.5068 0.496 0.4813 0.4947
Does pandas
have a builtin to drop columns based on the value of some row? I'm interested in the final row for a given column and removing the columns with the highest value there.
This might be a little over-killed:
to_drop = (df.filter(like='Run') # choose only the `Run` columns
.iloc[-1] # and the last row
.nlargest(2) # two largest cells
.index # then the index, i.e. column names
)
df = df.drop(to_drop, axis=1)