Search code examples
pythonpandasdataframe

Shift part of row in dataframe to new row


I have a dataframe (pandas) that I want to transform for displaying purposes. Therefore I want to shift some parts of the dataframe to new rows like below :

  col1 col2 col_to_shift col_not_to_shift1 col_not_to_shift2
0   a1   a2           a3                a4                a5
1   b1   b2          NaN                b4               NaN
2   c1   c2           c3                c4                c5

I would like to obtain the following dataframe where a new row is created for the value (if not NaN) in the column to shift (unique), duplicating the data description contained in col1 and col2 and keeping the columns that I don't want to shift even if they contain NaNs:

  col1 col2 col_to_shift col_not_to_shift1 col_not_to_shift2
0   a1   a2           a3               NaN               NaN
1   a1   a2          NaN                a4                a4
2   b1   b2          NaN                b4               NaN
3   c1   c2           c3               NaN               NaN
4   c1   c2          NaN                c4                c4

I tried looking at pd.shift but was not able to make it work.

Here is a piece of code to produce the dataframe :

data = {"col1": ['a1', 'b1', 'c1'], 'col2': ['a2', 'b2', 'c2'],
        'col_to_shift': ['a3', np.NaN, 'c3'],
        'col_not_to_shift1': ['a4', 'b4', 'c4'],
        'col_not_to_shift2': ['a5', np.NaN, 'c5']}
df = pd.DataFrame(data)

Solution

  • One option using concat:

    • split the DataFrame in two parts keeping grouping columns for both
    • drop the empty rows with dropna
    • sort the rows with sort_values and a stable sort algorithm
    group = ['col1', 'col2']
    cols  = df.columns[df.columns.str.contains('not_to_shift')]
    
    out = (pd.concat([df.drop(columns=cols).dropna(axis=0),
                      df[cols.union(group)].dropna(axis=0)
                     ])
             .sort_values(by=group, kind='stable')
          )
    

    Output:

      col1 col2 col_to_shift col_not_to_shift
    0   a1   a2           a3              NaN
    0   a1   a2          NaN               a4
    1   b1   b2          NaN               b4
    2   c1   c2           c3              NaN
    2   c1   c2          NaN               c4
    

    Variant using the index as group (this does not maintain the original index):

    group = ['col1', 'col2']
    not_shift = ['col_not_to_shift']
    
    tmp = df.set_index(group)
    out = (pd.concat([tmp.drop(columns=not_shift), tmp[not_shift]])
             .dropna(how='all').sort_index(kind='stable')
             .reset_index()
          )
    

    Or by defining the list of columns to shift/not-shift:

    shift = ['col_to_shift']
    not_shift = ['col_not_to_shift1', 'col_not_to_shift2']
    
    out = (pd.concat([df.drop(columns=not_shift),
                      df.drop(columns=shift)
                     ])
             .dropna(subset=shift+not_shift, how='all')
             .sort_index(kind='stable')
          )
    

    Output:

      col1 col2 col_to_shift col_not_to_shift1 col_not_to_shift2
    0   a1   a2           a3               NaN               NaN
    0   a1   a2          NaN                a4                a5
    1   b1   b2          NaN                b4               NaN
    2   c1   c2           c3               NaN               NaN
    2   c1   c2          NaN                c4                c5