Search code examples
pythonstringpandasdataframenan

Concatenate column values in a pandas DataFrame while ignoring NaNs


I have a the following pandas table

df:

 EVNT_ID col1 col2 col3 col4
 123454   1    Nan   4    5
 628392   Nan   3   Nan   7
 293899   2    Nan  Nan   6
 127820   9    11    12   19

Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way

new_df:

 EVNT_ID col1 col2 col3 col4 new_col
 123454   1    Nan   4    5   1|4|5
 628392   Nan   3   Nan   7    3|7
 293899   2    Nan  Nan   6    2|6
 127820   9    11    12   19  9|11|12|19

I am using the following code

df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)

but it is giving me the following error

ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.


Solution

  • You can do this with filter and agg:

    df.filter(like='col').agg(
        lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
    
    0         1|4|5
    1           3|7
    2           2|6
    3    9|11|12|19
    dtype: object
    

    Or,

    df.drop('EVNT_ID', 1).agg(
            lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
    
    0         1|4|5
    1           3|7
    2           2|6
    3    9|11|12|19
    dtype: object
    

    If performance is important, you can use a list comprehension:

    joined = [
        '|'.join([str(int(x)) for x in r if pd.notna(x)]) 
        for r in df.iloc[:,1:].values.tolist()
    ]
    joined
    # ['1|4|5', '3|7', '2|6', '9|11|12|19']
    
    df.assign(new_col=joined)   
    
       EVNT_ID  col1  col2  col3  col4     new_col
    0   123454   1.0   NaN   4.0     5       1|4|5
    1   628392   NaN   3.0   NaN     7         3|7
    2   293899   2.0   NaN   NaN     6         2|6
    3   127820   9.0  11.0  12.0    19  9|11|12|19
    

    If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.

    df = pd.concat([df] * 1000, ignore_index=True)
    
    # In this post.
    %%timeit
    [
         '|'.join([str(int(x)) for x in r if pd.notna(x)]) 
         for r in df.iloc[:,1:].values.tolist()
    ]
    # RafaelC's answer.
    %%timeit
    [
        '|'.join([k for k in a if k])
        for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
    ]    
    
    31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Although note the answers aren't identical because @RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.