Search code examples
pythonpandasfor-loopduplicates

Duplicate row in Pandas dataframe based on condition, then update a new column based on condition


I have a dataframe that looks like :

enter image description here

df = pd.DataFrame({'qty': [10,7,2,1],
                   'status 1': [5,2,2,0],
                   'status 2': [3,2,0,1],
                   'status 3': [2,3,0,0]
                   })

Each row has a qty of items. These items have one status (1,2 or 3).

So qty = sum of values of status 1,2,3.

I would like to :

  1. Duplicate each row by the "qty" column

  2. Then edit 3 status (or update a new column), to get just 1 status.

The output should look like this:

enter image description here

Edit: the order is not important, but I will need to keep other columns of my initial df.

My (incomplete) solution so far - I found a way to duplicate the rows using this :

df2= df2.loc[df2.index.repeat(df2['qty'])].reset_index(drop=True)

But I can't find a way to fill the status.

Do I need to use a for loop approach to fill the status?

Should I do this transform in 1 or 2 steps?

Something like: for each initial row, the n first rows take the first status, where n is the value of status 2....

The output could maybe looks like :

enter image description here

Edit1 : Thank you for your answers !

Last question : now I'm trying to integrate this to my actual df. What is the best approach to apply these methods to my df who contains many other column ?

I will answer my last question :

  1. Split df in 2: dfstatus and dfwithoutstatus, keeping the qty column in both

  2. Apply one of your method on the dfstatus

  3. Apply my method on the dfwithoutstatus (a simple duplication)

  4. Merge on index

Thank you all for your answers.

Best


Solution

  • Here is a possible solution:

    import numpy as np
    import pandas as pd
    
    E = pd.DataFrame(np.eye(df.shape[1] - 1, dtype=int))
    result = pd.DataFrame(
        df['qty'].reindex(df.index.repeat(df['qty'])).reset_index(drop=True),
    )
    result[df.columns[1:]] = pd.concat(
        [E.reindex(E.index.repeat(df.iloc[i, 1:]))
         for i in range(len(df))],
    ).reset_index(
        drop=True,
    )
    

    Here is the result:

    >>> result
        qty  status 1  status 2  status 3
    0    10         1         0         0
    1    10         1         0         0
    2    10         1         0         0
    3    10         1         0         0
    4    10         1         0         0
    5    10         0         1         0
    6    10         0         1         0
    7    10         0         1         0
    8    10         0         0         1
    9    10         0         0         1
    10    7         1         0         0
    11    7         1         0         0
    12    7         0         1         0
    13    7         0         1         0
    14    7         0         0         1
    15    7         0         0         1
    16    7         0         0         1
    17    2         1         0         0
    18    2         1         0         0
    19    1         0         1         0