Search code examples
pythonpandasdataframerow

Collapsing rows in a Pandas dataframe if all rows have only one value in their columns


I have following DF

         col1  |  col2   | col3   | col4   | col5  | col6
    0    -     |   15.0  |  -     |  -     |   -   |  -
    1    -     |   -     |  -     |  -     |   -   |  US
    2    -     |   -     |  -     |  Large |   -   |  -
    3    ABC1  |   -     |  -     |  -     |   -   |  -
    4    -     |   -     |  24RA  |  -     |   -   |  -
    5    -     |   -     |  -     |  -     |   345 |  -

I want to collapse rows into one as follows

    output DF:
         col1  |  col2    | col3   | col4   | col5  | col6
    0    ABC1  |   15.0   |  24RA  |  Large |   345 |  US

I do not want to iterate over columns but want to use pandas to achieve this.


Solution

  • Option 0
    Super Simple

    pd.concat([pd.Series(df[c].dropna().values, name=c) for c in df], axis=1)
    
       col1  col2  col3   col4   col5 col6
    0  ABC1  15.0  24RA  Large  345.0   US
    

    Can we handle more than one value per column?
    Sure we can!

    df.loc[2, 'col3'] = 'Test'
    
       col1  col2  col3   col4   col5 col6
    0  ABC1  15.0  Test  Large  345.0   US
    1   NaN   NaN  24RA    NaN    NaN  NaN
    

    Option 1
    Generalized solution using np.where like a surgeon

    v = df.values
    i, j = np.where(np.isnan(v))
    
    s = pd.Series(v[i, j], df.columns[j])
    
    c = s.groupby(level=0).cumcount()
    s.index = [c, s.index]
    s.unstack(fill_value='-')  # <-- don't fill to get NaN
    
       col1  col2  col3   col4 col5 col6
    0  ABC1  15.0  24RA  Large  345   US
    

    df.loc[2, 'col3'] = 'Test'
    
    v = df.values
    i, j = np.where(np.isnan(v))
    
    s = pd.Series(v[i, j], df.columns[j])
    
    c = s.groupby(level=0).cumcount()
    s.index = [c, s.index]
    s.unstack(fill_value='-')  # <-- don't fill to get NaN
    
       col1  col2  col3   col4 col5 col6
    0  ABC1  15.0  Test  Large  345   US
    1     -     -  24RA      -    -    -
    

    Option 2
    mask to make nulls then stack to get rid of them

    Or we could have

    # This should work even if `'-'` are NaN
    # but you can skip the `.mask(df == '-')`
    s = df.mask(df == '-').stack().reset_index(0, drop=True)
    c = s.groupby(level=0).cumcount()
    s.index = [c, s.index]
    s.unstack(fill_value='-')
    
       col1  col2  col3   col4 col5 col6
    0  ABC1  15.0  Test  Large  345   US
    1     -     -  24RA      -    -    -