Search code examples
pythonpandascsvdataframestring-concatenation

How to merge data from two columns into one with a + sign using pandas


I have 2 columns in this format in a csv.

Drug    Product(Predicted)
#N/A    acetaldehyde
#N/A    nicotine
#N/A    resin
#N/A    rosa
#N/A    eliglustat
#N/A    valpromide
#N/A    clostridium
#N/A    ambenonium
solution    valpromide
solution    valpromide
solution    valpromide
solution    valpromide
solution    valpromide
solution    valpromide
solution    valpromide
solution    valpromide
solution    valpromide
#N/A    amiloride
#N/A    valsartán
#N/A    manitol
#N/A    hipromelosa
#N/A    fosfato disódico
#N/A    fosfato disódico
#N/A    agalsidase beta
#N/A    enoxaparin sodium
solution    warfarine
#N/A    eliglustat
#N/A    silver
#N/A    silver

The output I am looking for is that, create a new column and whenever there is an entry present in the drug column(ignore all N/A), add that entry to the new column by concatenating the 2 columns with a plus sign and get an output csv

For example:

New column
solution + valpromide
solution + valpromide
solution + warfarine

I am new to pandas so I wanted to know how this can be done using the same


Solution

  • Use Series.str.cat with Series.str.strip:

    df['new'] = df['Drug'].str.cat(df['Product(Predicted)'], na_rep='', sep='+').str.strip('+')
    

    Or replace missing values to empty strings before merge:

    df['new'] = (df['Drug'].fillna('') + '+' + df['Product(Predicted)']).str.strip('+')
    

    print (df)
           Drug Product(Predicted)                  new
    0       NaN       acetaldehyde         acetaldehyde
    1       NaN           nicotine             nicotine
    2       NaN              resin                resin
    3       NaN               rosa                 rosa
    4       NaN         eliglustat           eliglustat
    5       NaN         valpromide           valpromide
    6       NaN        clostridium          clostridium
    7       NaN         ambenonium           ambenonium
    8  solution         valpromide  solution+valpromide
    9  solution         valpromide  solution+valpromide
    

    If want remove rows with NaNs in drug column:

    df1 = df.dropna(subset=['Drug'])
    df1['new'] = df1['Drug'] + '+' + df1['Product(Predicted)']
    print (df)
           Drug Product(Predicted)                  new
    8  solution         valpromide  solution+valpromide
    9  solution         valpromide  solution+valpromide