I have multiple csv files that were produced by tokenizing code. These files contain keywords in uppercase and lowercase. I would like to merge all those files in one single dataframe which contains all the unique values (summed) in lowercase. What would you suggest to get the result below?
Initial DF:
+---+---+----+-----+
| a | b | A | B |
+---+---+----+-----+
| 1 | 2 | 3 | 1 |
| 2 | 1 | 3 | 1 |
+---+---+----+-----+
Result
+---+---+
| a | b |
+---+---+
| 4 | 3 |
| 5 | 2 |
+---+---+
I don't have access to the raw data from which the csv files where created so I cannot correct this at an earlier step. At the moment I have tried mapping .lower() to the dataframe headers that I create, but it returns seperate columns with the same name like so:
Using pandas is not essential. I have thought of converting the csv files to dictionaries and then trying the above procedure (turns out it is much more complicated than I thought), or using lists. Also, group by does not do the job as it will remove non duplicate column names. Any approach is welcome.
You could iterate through the columns summing those that have the same lowercase representation:
def sumDupeColumns(df):
"""Return dataframe with columns with the same lowercase spelling summed."""
# Get list of unique lowercase column headers
columns = set(map(str.lower, df.columns))
# Create new (zero-initialised) dataframe for output
df1 = pd.DataFrame(data=np.zeros((len(df), len(columns))), columns=columns)
# Sum matching columns
for col in df.columns:
df1[col.lower()] += df[col]
return df1
import pandas as pd
import numpy as np
np.random.seed(seed=42)
# Generate DataFrame with random int input and 'duplicate' columns to sum
df = pd.DataFrame(columns = ['a','A','b','B','Cc','cC','d','eEe','eeE','Eee'],
data = np.random.randint(9, size=(5,10))
df = sumDupeColumns(df)
>>> print(df)
d eee cc a b
0 6.0 14.0 8.0 9.0 11.0
1 7.0 10.0 5.0 14.0 7.0
2 3.0 14.0 8.0 5.0 8.0
3 3.0 17.0 7.0 8.0 12.0
4 0.0 11.0 9.0 5.0 9.0