I have many dataframes in pickle format, they have the same information, but columns name are not identical in all of them. For example:
> file name: Columns_['Weryfikacja UK - 2022-04-14.xlsx_20uk-woocommerce-last-5yrs-expo.pkl'].pkl
> columns_nam: Index(['domain', 'phones', 'phones_data_source', 'company_name', 'SourceFile'], dtype='object')
>
> file name: Columns_['US _ Canada - rechurn - 01.2022-04.2022.xlsx_Other.pkl'].pkl
> columns_nam: Index(['Phone', 'Domain', 'SourceFile'], dtype='object')
>
> file name: Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
> columns_nam: Index(['Phone', 'Alternative phone 1', 'Alternative phone2', 'Alternative phone3', 'Alternative phone4', 'SourceFile'], dtype='object')
I have a dict like this to rename columns names in all files:
my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
"AlternativePhone1": ['Alternative phone1','Alternative phone 2', 'phones2'],
"AlternativePhone2": ['Alternative phone2','Alternative phone 3', 'phones3'],
"AlternativePhone3": ['Alternative phone3','Alternative phone 4', 'phones4'],
"AlternativePhone4": ['Alternative phone4','Alternative phone 5', 'phones5'],
"SourceFile": ['SourceFile']
}
I need a help with code, how should I do it?
for file in glob.glob("*.pkl"):
df = pd.read_pickle(file)
On output I would like to have something like this
> file name: Columns_['2022-08 - US _ Canada.xlsx_29.08-02.09 WixStore USCA.pkl'].pkl
> columns_nam: Index(['Phone', 'AlternativePhone1', 'AlternativePhone2', 'AlternativePhone3', 'AlternativePhone4', 'SourceFile'], dtype='object')
I don't think there is a standard way to rename columns based on a list of possible names built-in to pandas, but its easy enough with a few for loops:
my_dict = {
"Domain": ['Domain','domain', 'WWW', 'www'],
"Phone": ['Phone','phone_number', 'phones', 'Tel'],
}
df = pd.DataFrame({
"WWW": [1,2,3,4,5],
"phones": [1,2,3,4,5]
})
for (goal, names) in my_dict.items():
for name in names:
if name in df.columns:
df.rename(columns={name:goal}, inplace=True)
The pandas.rename
function does not actually care if the dataframe contains all the columns you provide new names for in the columns
dictonary. This means that if we convert my_dict
from newname -> List[oldname]
to the oldname -> newname
fromat expected by columns
we can rename the columns without any for loops.
columns_dict = dict(sum([[(old, new) for old in olds] for (new, olds) in my_dict.items()], []))
# columns_dict = {
# 'Domain': 'Domain', 'domain': 'Domain', 'WWW': 'Domain', 'www': 'Domain',
# 'Phone': 'Phone', 'phone_number': 'Phone', 'phones': 'Phone', 'Tel': 'Phone'
# }
df.rename(columns=columns_dict, inplace=True)