Search code examples
pythonpandascombinations

Generate combinations by systematically selecting rows from groups (using pandas)


I have a pandas dataframe df which appears as following: (toy version below but the real df contains many more columns and groups)

group  sub  fruit
a      1    apple
a      2    banana
a      3    orange
b      1    pear
b      2    strawberry
b      3    cherry
c      1    kiwi
c      2    tomato
c      3    lemon

All groups have the same number of rows. I am trying to generate a new dataframe that contains all the combinations of group and sub but specifically so that each combo contains the all types of groups and all types of subs.

Desired output:

combo  group  sub  fruit
1      a      1    apple
1      b      2    strawberry
1      c      3    lemon
2      a      1    apple
2      c      2    tomato
2      b      3    cherry
3      b      1    pear
3      a      2    banana
3      c      3    lemon
4      c      1    kiwi
4      a      2    banana
4      b      3    cherry
5      c      1    kiwi
5      b      2    strawberry
5      a      3    orange
...

So the below would be a wrong combination, since it was two values of the same sub:

6      c      2    tomato
6      b      2    strawberry
6      a      3    orange

A previously post of mine randomly selected subs but I realized was too unconstrained: Generate combinations by randomly selecting a row from multiple groups (using pandas)


Solution

  • A solution could be:

    import pandas as pd
    import numpy as np
    from itertools import permutations
    
    df = pd.DataFrame({"group":{"0":"a","1":"a","2":"a","3":"b","4":"b","5":"b","6":"c","7":"c","8":"c"},
                       "sub":{"0":1,"1":2,"2":3,"3":1,"4":2,"5":3,"6":1,"7":2,"8":3},
                       "fruit":{"0":"apple","1":"banana","2":"orange","3":"pear","4":"strawberry",
                                "5":"cherry","6":"kiwi","7":"tomato","8":"lemon"}})
    
    df2 = pd.DataFrame({"combo": [j for i in ([i]*3 for i in range(1,7)) for j in i],
                        "group": [j for i in permutations(["a","b","c"]) for j in i], 
                        "sub":[1,2,3]*6})
    
    pd.merge(df2, df, how="left", on=["group","sub"])