Search code examples
pythonpandasdata-cleaning

How to join multiple dataframe columns based on row index to specified column?


PROBLEM STATEMENT:

I'm trying to join multiple pandas data frame columns, based on row index, to a single column already in the data frame. Issues seem to happen when the data in a column is read in as np.nan.

EXAMPLE:

Original Data frame

time msg d0 d1 d2
0 msg0 a b c
1 msg1 x x x
2 msg0 a b c
3 msg2 1 2 3

What I want, if I were to filter for msg0 and msg2

time msg d0 d1 d2
0 msg0 abc NaN NaN
1 msg1 x x x
2 msg0 abc NaN Nan
3 msg2 123 NaN NaN

MY ATTEMPT:

df = pd.DataFrame({'time': ['0', '1', '2', '3'],
               'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
               'd0': ['a', 'x', 'a', '1'],
               'd1': ['b', 'x', 'b', '2'],
               'd2': ['c', 'x', np.nan, '3']})

mask = df.index[((df['msg'] == "msg0") |
               (df['msg'] == "msg1") |
               (df['msg'] == "msg3"))].tolist()

# Is there a better way to combine all columns after a certian point?
# This works fine here but has issues when importing large data sets.
# the 'd0' will be set to NaN too, I think this is due to np.nan
# being set to some columns values when imported.
df.loc[mask, 'd0'] = df['d0'] + df['d1'] + df['d2']
df.iloc[mask, 3:] = "NaN"


Solution

  • The approach might be somewhat similar to @mozway's answer I will make it more detailed to be easier to follow.

    1- Define your target columns and messages (just to make it easier to deal with)

    # the messages to filter
    msgs = ["msg0", "msg2"]
    # the columns to filter
    columns = df.columns.drop(['time', 'msg'])
    # the column to contain the result
    total_col = ["d0"]
    

    2- Mask the rows based on the (msgs) column value

    mask = df['msg'].isin(msgs)
    

    3- Find the value of the combined values

    # a- mask the dataframe to the target columns and rows. 
    # b- apply ''.join() to join all the column values
    # c- to join columns not rows apply on axis = 1 
    new_total_col = df.loc[mask, columns].apply(lambda x: ''.join(x.dropna().astype(str)), axis=1)
    

    4- Set all target columns and rows to np.nan and redefine the values of the "total" column

    df.loc[mask, columns] = np.nan
    df.loc[mask, total_col] = new_total_col
    

    Result

     time    msg    d0  d1  d2
    0   0   msg0    abc NaN NaN
    1   1   msg1    x   x   x
    2   2   msg0    ab  NaN NaN
    3   3   msg2    123 NaN NaN