dataframe apache-spark pyspark apache-spark-sql format-string

Pyspark dataframe: creating column based on other column values

I have a pyspark dataframe:

Now, I want to add a new column called "countryAndState", where, for example for the first row, the value would be "USA_CA". I have tried several approaches, the last one was the following:

df_2 = df.withColumn("countryAndState", '{}_{}'.format(df.country, df.state))

I have tried with "country" and "state" instead, or with simply country and state,and also using col() but nothing seems to work. Can anyone help me solve this?

Solution

You can't use Python format strings in Spark. Use concat instead:

import pyspark.sql.functions as F

df_2 = df.withColumn("countryAndState", F.concat(F.col('country'), F.lit('_'), F.col('state')))

or concat_ws, if you need to chain many columns together with a given separator:

import pyspark.sql.functions as F

df_2 = df.withColumn("countryAndState", F.concat_ws('_', F.col('country'), F.col('state')))