python pandas dataframe lambda concatenation

how to concatenate contents with linebreaks in dataframe

I have three columns (col1, col2, col3) having string contents with line breaks inserted. I would like to concatenate col1,col2 and col3 with new column (DesiredCol) as shown below

Here is the dataset

import pandas as pd

d = {'col1': ["ABC1"+"\n"+"ABC2"+"\n"+"ABC3", "BBC1"+"\n"+"BBC2"+"\n"+"BBC3"], 'col2': ["A"+"\n"+"B"+"\n"+"C", "A"+"\n"+"B"+"\n"+"C"],'col3': ["YES"+"\n"+"NO"+"\n"+"YES", "NO"+"\n"+"NO"+"\n"+"YES"]}

df = pd.DataFrame(data=d)

I tried using lambda function as below, however its not giving the desired column

cols=['col1','col2','col3']

df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

Solution

Here is one possible solution (the key being to use the zip built-in function to join the respective elements, and then further apply methods on the dataframe to insert the desired newline characters specifically):

A) Edit: Fastest Solution

(Thanks to mozway.) For an understanding of what this code achieves, and how each operation changes the data step by step into the desired column output, see solution "B" below. However, for large dataframes, it is better to avoid the use of applymap and instead combine map and zip in a list comprehension as shown here.

df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

B) Original Solution

Providing a breakdown of the dataframe operations. *Note: applymap is a simple way of applying a function to every element (or cell) of a dataframe and is shown here for illustration of the intermittent states of the dataframe, but usually there is a faster approach.

Full code:

Computationally more efficient solution:

However, solution "A" above will be faster and is recommended. Using the transpositions as shown here is not the most efficient approach.

df = pd.DataFrame(data=d)

df2 = df.T.apply(
    lambda x: list(
        map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
    )
).T

cols = df2.columns
df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)

Full code for solution as used in explained steps below:

df2 = df.applymap(lambda s: s.split('\n'))

df3 = df2.T.apply(lambda x: list(zip(*x.values)))

df3 = df3.T.apply(lambda x: x.str.join(' '))

cols = df3.columns
df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)

Breakdown of df operations

1) Use `applymap`* to delimit each cell in cols 1-3

import pandas as pd

d = {
    "col1": [
        "ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
        "BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
    ],
    "col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
    "col3": [
        "YES" + "\n" + "NO" + "\n" + "YES",
        "NO" + "\n" + "NO" + "\n" + "YES",
    ],
}

df = pd.DataFrame(data=d)

df2 = df.applymap(lambda s: s.split("\n"))
df2

2) Then, apply `zip` to respectively join the elements of the lists from the columns

df3 = df2.T.apply(lambda x: list(zip(*x.values)))

df3 = df3.T.apply(lambda x: x.str.join(" "))
df3

3) And lastly join the zipped new multi-column-concatenated lists into newline character-delimited strings


cols = df3.columns
df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)

df

which gives the final solution: