I have a dataset with one data type spread across multiple columns. I'd like to reduce these to a single column. I have a function that accomplishes this, but its a cumbersome process and I'm hoping there's a cleaner way to accomplish this. Here's a toy sample of my data:
UID COMPANY EML MAI TEL
273 7UP nan nan TEL
273 7UP nan MAI nan
906 WSJ nan nan TEL
906 WSJ EML nan nan
736 AIG nan MAI nan
What I'd like to get to:
UID COMPANY CONTACT_INFO
273 7UP MT
906 WSJ ET
736 AIG M
I've solved this by writing a function that converts EML
, MAI
or TEL
to a prime number, aggregates the results then converts the sum into the constituent contact types. This works, and is reasonably quick. Here's a sample:
def columnRedux(df):
newDF = df.copy()
newDF.fillna('-', inplace=True)
newDF['CONTACT_INFO'] = newDF['EML'] + newDF['MAI'] + newDF['TEL']
newDF.replace('EML--', 7, inplace=True)
newDF.replace('-MAI-', 101, inplace=True)
newDF.replace('--TEL', 1009, inplace=True)
small = newDF.groupby(['UID', 'COMPANY'], as_index=False)['CONTACT_INFO'].sum()
small.replace(7, 'E', inplace=True)
small.replace(101, 'M', inplace=True)
small.replace(108, 'EM', inplace=True)
small.replace(1009, 'T', inplace=True)
small.replace(1016, 'ET', inplace=True)
small.replace(1110, 'MT', inplace=True)
small.replace(1117, 'EMT', inplace=True)
return small
df1 = pd.DataFrame(
{'EML' : [np.nan, np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', 'EML', np.nan],
'MAI' : [np.nan, 'MAI', np.nan, np.nan, 'MAI', np.nan, np.nan, np.nan, 'MAI', np.nan, np.nan, 'MAI'],
'COMPANY' : ['7UP', '7UP', 'UPS', 'UPS', 'UPS', 'WSJ', 'WSJ', 'TJX', 'AIG', 'CDW', 'HEB', 'HEB'],
'TEL' : ['TEL', np.nan, 'TEL', np.nan, np.nan, 'TEL', np.nan, 'TEL', np.nan, np.nan, np.nan, np.nan],
'UID' : [273, 273, 865, 865, 865, 906, 906, 736, 316, 458, 531, 531]},
columns=['UID', 'COMPANY', 'EML', 'MAI', 'TEL'])
cleanDF = columnRedux(df1)
My issue is that I have several data sets, each with its own set of "wide" columns. Some have 5+ columns to be reduced. Hard coding the conversions for all of the variations is not trivial. Is there a cleaner way to accomplish this?
Maybe not the "nicest" solution. But one would be to use a simple groupby and condition the included elements:
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
Or an alternative would be to convert the grouped dataframes to type str and replace the strings and sum. Quite readable I'd say.
m = {
'nan':'',
'EML':'E',
'MAI':'M',
'TEL':'T'
}
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: x.astype(str).replace(m).sum().sum())\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
Full example:
import pandas as pd
import numpy as np
data = '''\
UID COMPANY EML MAI TEL
273 7UP nan nan TEL
273 7UP nan MAI nan
906 WSJ nan nan TEL
906 WSJ EML nan nan
736 AIG nan MAI nan'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+').replace('NaN',np.nan)
# use a nested list comprehension to flatten the array and remove nans.
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
print(df)
Returns:
UID COMPANY CONTACT_INFO
273 7UP MT
736 AIG M
906 WSJ ET
dtype: object