Search code examples
pythonpandasdataframelambdaconcatenation

how to concatenate contents with linebreaks in dataframe


I have three columns (col1, col2, col3) having string contents with line breaks inserted. I would like to concatenate col1,col2 and col3 with new column (DesiredCol) as shown below

enter image description here

Here is the dataset

import pandas as pd

d = {'col1': ["ABC1"+"\n"+"ABC2"+"\n"+"ABC3", "BBC1"+"\n"+"BBC2"+"\n"+"BBC3"], 'col2': ["A"+"\n"+"B"+"\n"+"C", "A"+"\n"+"B"+"\n"+"C"],'col3': ["YES"+"\n"+"NO"+"\n"+"YES", "NO"+"\n"+"NO"+"\n"+"YES"]}

df = pd.DataFrame(data=d)

I tried using lambda function as below, however its not giving the desired column

cols=['col1','col2','col3']

df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

Solution

  • Here is one possible solution (the key being to use the zip built-in function to join the respective elements, and then further apply methods on the dataframe to insert the desired newline characters specifically):

    A) Edit: Fastest Solution

    (Thanks to mozway.) For an understanding of what this code achieves, and how each operation changes the data step by step into the desired column output, see solution "B" below. However, for large dataframes, it is better to avoid the use of applymap and instead combine map and zip in a list comprehension as shown here.

    df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                        for x in zip(*(df[c] for c in cols))]
    

    B) Original Solution

    Providing a breakdown of the dataframe operations. *Note: applymap is a simple way of applying a function to every element (or cell) of a dataframe and is shown here for illustration of the intermittent states of the dataframe, but usually there is a faster approach.

    Full code:

    Computationally more efficient solution:

    However, solution "A" above will be faster and is recommended. Using the transpositions as shown here is not the most efficient approach.

    df = pd.DataFrame(data=d)
    
    df2 = df.T.apply(
        lambda x: list(
            map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
        )
    ).T
    
    cols = df2.columns
    df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)
    

    Full code for solution as used in explained steps below:

    df2 = df.applymap(lambda s: s.split('\n'))
    
    df3 = df2.T.apply(lambda x: list(zip(*x.values)))
    
    df3 = df3.T.apply(lambda x: x.str.join(' '))
    
    cols = df3.columns
    df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)
    

    Breakdown of df operations

    1) Use applymap* to delimit each cell in cols 1-3

    import pandas as pd
    
    d = {
        "col1": [
            "ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
            "BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
        ],
        "col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
        "col3": [
            "YES" + "\n" + "NO" + "\n" + "YES",
            "NO" + "\n" + "NO" + "\n" + "YES",
        ],
    }
    
    df = pd.DataFrame(data=d)
    
    df2 = df.applymap(lambda s: s.split("\n"))
    df2
    

    showing effect of first apply method delimiting by newline characters

    2) Then, apply zip to respectively join the elements of the lists from the columns

    df3 = df2.T.apply(lambda x: list(zip(*x.values)))
    
    df3 = df3.T.apply(lambda x: x.str.join(" "))
    df3
    

    showing the now zipped elements from the three cols of orig df

    3) And lastly join the zipped new multi-column-concatenated lists into newline character-delimited strings

    
    cols = df3.columns
    df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)
    
    df
    

    which gives the final solution:

    Dataframe with zipped and joined by newline string concatenation