Search code examples
pandasdataframeduplicatesprobabilityrecord

How to duplicate records in pandas dataframe based on column values


I have created a pandas dataframe as follows:

ds = {'col1' : ["A","B"], 'probability' : [0.3, 0.6]}
df = pd.DataFrame(data=ds)

The dataframe looks like this:

print(df)   
  col1  probability
0    A   0.3
1    B   0.6

I need to create a new dataframe which duplicates each row and assign to the duplicated record a probability needed to sum up to 1.

From the example above:

  • I need to duplicate record 0 such that A gets a probability of 0.3 (so it keeps what's already in there) and the duplicated record gets a probability of 0.7 (0.3 + 0.7 = 1)
  • I need to duplicate record 1 such that B gets a probability of 0.6 (so it keeps what's already in there) and the duplicated record gets a probability of 0.4 (0.6 + 0.4 = 1)

The resulting dataframe looks like this:

  col1  probability
0    A          0.3
1    A          0.7
2    B          0.6
3    B          0.4

Can anyone help me doing it in pandas, please?


Solution

  • You can use this:

    df = pd.concat([df, df.assign(probability=1 - df["probability"])], ignore_index=True)
    
      col1  probability
    0    A          0.3
    1    B          0.6
    2    A          0.7
    3    B          0.4