Search code examples
pythondataframepysparkconcatenation

Concatenate all pyspark dataframe columns into one string column


I have the following pyspark dataframe

identification p1 p2 p3 p4
1 1 0 0 1
2 0 1 1 0
3 0 0 0 1

I want to concatenate all columns from p1 to p4 in a way to gather the values of 1 and 0 in this way

identification p1 p2 p3 p4 joined_column
1 1 0 0 1 1001
2 0 1 1 0 0110
3 0 0 0 1 0001

All columns from p1 to p4 are Integer, so I tried to cast them into string before concatenate all columns ( except the 'identification' column ) in this way:


from pyspark.sql.types import StringType 
from pyspark.sql import functions as F

df_concat=df.withColumn('joined_column', F.concat([F.col(c).cast(StringType()) for c in df.columns if  c!='identification']))

I get the following error :

TypeError: Invalid argument, not a string or column: 

Is there any solution or other way to concatenate all columns of a pyspark dataframe into one string ?


Solution

  • I will answer the question based on @samkart comment. We don't need to convert each column into a string, it will be converted automatically while concatenating.

    from pyspark.sql import functions as F
    
    df_concat=df.withColumn('joined_column', F.concat(*[F.col(c) for c in df.columns if  c!='identification']))
    

    This will be true independent of columns number and names