Search code examples
pythonpandasdataframereduction

Collapse 3 Text Colums into 1 within a wide Pandas DataFrame


I have a dataset with one data type spread across multiple columns. I'd like to reduce these to a single column. I have a function that accomplishes this, but its a cumbersome process and I'm hoping there's a cleaner way to accomplish this. Here's a toy sample of my data:

UID    COMPANY    EML    MAI   TEL
273    7UP        nan    nan   TEL
273    7UP        nan    MAI   nan
906    WSJ        nan    nan   TEL
906    WSJ        EML    nan   nan
736    AIG        nan    MAI   nan

What I'd like to get to:

UID    COMPANY   CONTACT_INFO
273    7UP       MT
906    WSJ       ET
736    AIG       M

I've solved this by writing a function that converts EML, MAI or TEL to a prime number, aggregates the results then converts the sum into the constituent contact types. This works, and is reasonably quick. Here's a sample:

def columnRedux(df):
    newDF = df.copy()
    newDF.fillna('-', inplace=True)
    newDF['CONTACT_INFO'] = newDF['EML'] + newDF['MAI'] + newDF['TEL']
    newDF.replace('EML--', 7, inplace=True)
    newDF.replace('-MAI-', 101, inplace=True)
    newDF.replace('--TEL', 1009, inplace=True)

    small = newDF.groupby(['UID', 'COMPANY'], as_index=False)['CONTACT_INFO'].sum()

    small.replace(7, 'E', inplace=True)
    small.replace(101, 'M', inplace=True)
    small.replace(108, 'EM', inplace=True)
    small.replace(1009, 'T', inplace=True)
    small.replace(1016, 'ET', inplace=True)
    small.replace(1110, 'MT', inplace=True)
    small.replace(1117, 'EMT', inplace=True)

    return small

df1 = pd.DataFrame(
    {'EML' : [np.nan, np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', 'EML', np.nan],
    'MAI' : [np.nan, 'MAI', np.nan, np.nan, 'MAI', np.nan, np.nan, np.nan, 'MAI', np.nan, np.nan, 'MAI'],
    'COMPANY' : ['7UP', '7UP', 'UPS', 'UPS', 'UPS', 'WSJ', 'WSJ', 'TJX', 'AIG', 'CDW', 'HEB', 'HEB'],
    'TEL' : ['TEL', np.nan, 'TEL', np.nan, np.nan, 'TEL', np.nan, 'TEL', np.nan, np.nan, np.nan, np.nan],
    'UID' : [273, 273, 865, 865, 865, 906, 906, 736, 316, 458, 531, 531]},
    columns=['UID', 'COMPANY', 'EML', 'MAI', 'TEL'])

cleanDF = columnRedux(df1)

My issue is that I have several data sets, each with its own set of "wide" columns. Some have 5+ columns to be reduced. Hard coding the conversions for all of the variations is not trivial. Is there a cleaner way to accomplish this?


Solution

  • Maybe not the "nicest" solution. But one would be to use a simple groupby and condition the included elements:

    df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
        .apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
        .reset_index()\
        .rename(columns={0:'CONTACT_INFO'})
    

    Or an alternative would be to convert the grouped dataframes to type str and replace the strings and sum. Quite readable I'd say.

    m = {
        'nan':'',
        'EML':'E',
        'MAI':'M',
        'TEL':'T'
    }
    
    df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
           .apply(lambda x: x.astype(str).replace(m).sum().sum())\
           .reset_index()\
           .rename(columns={0:'CONTACT_INFO'})
    

    Full example:

    import pandas as pd
    import numpy as np
    
    data = '''\
    UID    COMPANY    EML    MAI   TEL
    273    7UP        nan    nan   TEL
    273    7UP        nan    MAI   nan
    906    WSJ        nan    nan   TEL
    906    WSJ        EML    nan   nan
    736    AIG        nan    MAI   nan'''
    
    fileobj = pd.compat.StringIO(data)
    df = pd.read_csv(fileobj, sep='\s+').replace('NaN',np.nan)
    
    # use a nested list comprehension to flatten the array and remove nans.
    df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
        .apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
        .reset_index()\
        .rename(columns={0:'CONTACT_INFO'})
    
    print(df)
    

    Returns:

    UID  COMPANY  CONTACT_INFO
    273      7UP            MT
    736      AIG             M
    906      WSJ            ET
    dtype: object