I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.
You can do this with filter
and agg
:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because @RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]
. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.