Search code examples
pythonapache-sparkpysparkazure-databricks

Concat all columns in a dataframe


I am coding Python in Databricks and I am using spark 2.4.5.

I need to have a UDF with two parameters. The first one is a Dataframe and the second one is SKid, in that Dataframe then I need to hash all columns on that dataframe.

I have written the below code but I need to know how can I concat all columns in a dynamic dataframe?

def xHashDataframe(df,skColumn):
  a = df.select(
      col(skColumn)
      ,md5(
      concat(
        col("column1"), lit("~"), 
        col("column2"), lit("~"),
        ...
        col("columnN"), lit("~")
      )).alias("RowHash")
    )
  return a
  

Solution

  • There is no need to use a UDF. concat_ws should do the trick:

    df.withColumn("RowHash", F.md5(F.concat_ws("~", *df.columns))).show(truncate=False)