Search code examples
pythonpandasdataframepython-itertools

How to create columns based on combinations of values between 2 sets of columns?


I have a dataframe with comma separated values that i split out out using pd.concat.

original df:

org    country         type
Orange   USA, GBR, AUS   OWF, PMR, KIQ
Red      AUS, RUS, NZL   DOG, MOP, LOF

splitting out columns gives me a df, which we'll call df_wide,

org        country_1    country_2   country_3   type_1   type_2   type_3
Orange        USA          GBR         AUS         OWF      PMR      KIQ
Watermelon    AUS          RUS         NZL         ODG      MOP      LOF

From the above dataframe, i need to get every possible combination of a single country and single type in a long format:

org     country    type
Orange  USA        OWF
Orange  USA        PMR
Orange  USA        KIQ
Orange  GBR        OWF
Orange  GBR        PMR
Orange  GBR        KIQ

..and so forth

and this is where I'm stuck. I mistakenly thought that i could just transform the dataframe using pd.wide_to_long, but I think the answer my revolves around using itertools. I've searched the forums that relate to this issue, but I still haven't quite figured it out. Looking for any suggestions! also the commas separated values in the original df columns could be dozens of values and therefore I don’t know how many columns wide my wide df will be.


Solution

  • Here's one solution using itertools.product. It doesn't require the intermediary dataframe you created.

    from itertools import chain, product
    
    df = pd.DataFrame({'org': ['Orange', 'Red'],
                       'country': ['USA, GBR, AUS', 'AUS, RUS, NZL'],
                       'type': ['OWF, PMR, KIQ', 'DOG, MOP, LOF']})
    
    split1 = df['country'].str.split(', ')
    split2 = df['type'].str.split(', ')
    
    lens = split1.map(len) * split2.map(len)
    
    c_list, t_list = zip(*chain.from_iterable(map(product, split1, split2)))
    
    res = pd.DataFrame({'org': np.repeat(df['org'], lens),
                        'country': c_list,
                        'type': t_list})
    

    Explanation

    The magic happens with this line:

    c_list, t_list = zip(*chain.from_iterable(map(product, split1, split2)))
    

    Working from the inside out:

    • Calculate the Cartesian product for each pair of items across split1 / split2.
    • Chain them together into a non-nested iterable of results.
    • Unpack and zip into countries and types.

    Result

    print(res)
    
          org country type
    0  Orange     USA  OWF
    0  Orange     USA  PMR
    0  Orange     USA  KIQ
    0  Orange     GBR  OWF
    0  Orange     GBR  PMR
    0  Orange     GBR  KIQ
    0  Orange     AUS  OWF
    0  Orange     AUS  PMR
    0  Orange     AUS  KIQ
    1     Red     AUS  DOG
    1     Red     AUS  MOP
    1     Red     AUS  LOF
    1     Red     RUS  DOG
    1     Red     RUS  MOP
    1     Red     RUS  LOF
    1     Red     NZL  DOG
    1     Red     NZL  MOP
    1     Red     NZL  LOF