Search code examples
pythonpandasdataframeappendpadding

How can I append a variable number of blank rows to a dataframe by group?


I have a dataframe that looks like the following, with x number of person ids (more than 1000 persons), x number of transactions per person, and x number of variables (more than 1000 variables):

Person_ID transaction_ID variable_1 variable_2 variable_3 variable_X
person1 transaction1 123 0 1 abc
person1 transaction2 456 1 0 def
person1 transaction3 123 0 1 abc
personx transaction1 123 0 1 abc
personx transaction2 456 0 1 def

I want to pad it with rows containing -10 at the beginning of every person id group so that the total number of rows per person id group is 6, like the following:

Person_ID transaction_ID variable_1 variable_2 variable_3 variable_X
person1 -10 -10 -10 -10 -10
person1 -10 -10 -10 -10 -10
person1 -10 -10 -10 -10 -10
person1 transaction1 123 0 1 abc
person1 transaction2 456 1 0 def
person1 transaction3 123 0 1 abc
personx -10 -10 -10 -10 -10
personx -10 -10 -10 -10 -10
personx -10 -10 -10 -10 -10
personx -10 -10 -10 -10 -10
personx transaction1 123 0 1 abc
personx transaction2 456 0 1 def

Here is the code I tried (updated with concat) and the error below it.

df2 = pd.DataFrame([[''] * len(newdf.columns)], columns=newdf.columns)
df2
for row in newdf.groupby('person_id')['transaction_id']:
   x=newdf.groupby('person_id')['person_id'].nunique()
   if x.any() < 6:
       newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)

RuntimeWarning: '<' not supported between instances of 'int' and 'tuple', sort order is undefined for incomparable objects.
  newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)

It appended several NaN rows to the bottom of the dataframe, but not inbetween groups as needed. Thank you in advance as I am a beginner.


Solution

  • Code

    use groupby + apply

    def func1(df):
        n = 6 - len(df)
        if n > 0:
            df1 = pd.DataFrame(df['Person_ID'].iloc[0], columns=['Person_ID'], index=range(0, n))
            return pd.concat([df1.reindex(df.columns, axis=1, fill_value=-10), df])
    out = df.groupby('Person_ID', group_keys=False).apply(func1).reset_index(drop=True)
    

    out

    out

    enter image description here

    Example Code

    import pandas as pd
    data1 = {'Person_ID': ['person1', 'person1', 'person1', 'personx', 'personx'], 
             'transaction_ID': ['transaction1', 'transaction2', 'transaction3', 'transaction1', 'transaction2'], 
             'variable_1': [123, 456, 123, 123, 456], 
             'variable_2': [0, 1, 0, 0, 0], 
             'variable_3': [1, 0, 1, 1, 1], 
             'variable_X': ['abc', 'def', 'abc', 'abc', 'def']}
    df = pd.DataFrame(data1)