Search code examples
pythonpandasdataframereshapesplit-apply-combine

Pandas: How to combine several rows with the same column value and create a new Dataframe which covers all possibilities?


There exists a DataFrame like this:

id name age
0x0 Hans 32
0x0 Peter 21
0x1 Jan 42
0x1 Simon 25
0x1 Klaus 51
0x1 Franz 72

I'm aiming to create a DataFrame that covers any possible combination within the same ID.

The only possibility for ID 0x0 is Hans and Peter. Since ID 0x1 exists four times, there are six possible solutions, as shown in the table below.

Since this is an example, one ID could also exist three, five, seven, ... times which results in more or less possibilities.

id name0 age0 name1 age1
0x0 Hans 32 Peter 21
0x1 Jan 42 Simon 25
0x1 Jan 42 Klaus 51
0x1 Jan 42 Franz 72
0x1 Simon 25 Klaus 51
0x1 Simon 25 Franz 72
0x1 Klaus 51 Franz 72

Using combinations, I can already cover the possibility aspect, but I am losing the ages of each name on the way.

import pandas as pd
from  itertools import combinations

data = pd.DataFrame({'id': ["0x0", "0x0", "0x1", "0x1", "0x1", "0x1"], 'name': ["Hans","Peter","Jan","Simon","Klaus","Franz"], 'age': [32, 21, 42, 25, 51, 72]})

df = (data.groupby('id')['name'].apply(lambda x: pd.DataFrame(list(combinations(x,2))))
        .reset_index(level=1, drop=True)
        .reset_index())
print(df)


Solution

  • Core python itertools combinations is the solution. merge() to get the age

    import itertools
    df = pd.read_csv(io.StringIO("""id  name    age
    0x0 Hans    32
    0x0 Peter   21
    0x1 Jan 42
    0x1 Simon   25
    0x1 Klaus   51
    0x1 Franz   72"""), sep="\t")
    
    df1 = (
    df
        .groupby(["id"])["name"]
        .apply(lambda x: pd.DataFrame(itertools.combinations(list(x),2)))
        .reset_index()
        .merge(df, left_on=["id",0], right_on=["id","name"])
        .merge(df, left_on=["id",1], right_on=["id","name"], suffixes=("0","1"))
        .drop(columns=["level_1",0,1])
    )
    

    output

      id  name0  age0  name1  age1
     0x0   Hans    32  Peter    21
     0x1    Jan    42  Simon    25
     0x1    Jan    42  Klaus    51
     0x1  Simon    25  Klaus    51
     0x1    Jan    42  Franz    72
     0x1  Simon    25  Franz    72
     0x1  Klaus    51  Franz    72