I have a pyspark dataframe:
Now, I want to add a new column called "countryAndState", where, for example for the first row, the value would be "USA_CA". I have tried several approaches, the last one was the following:
df_2 = df.withColumn("countryAndState", '{}_{}'.format(df.country, df.state))
I have tried with "country"
and "state"
instead, or with simply country
and state
,and also using col()
but nothing seems to work. Can anyone help me solve this?
You can't use Python format strings in Spark. Use concat
instead:
import pyspark.sql.functions as F
df_2 = df.withColumn("countryAndState", F.concat(F.col('country'), F.lit('_'), F.col('state')))
or concat_ws
, if you need to chain many columns together with a given separator:
import pyspark.sql.functions as F
df_2 = df.withColumn("countryAndState", F.concat_ws('_', F.col('country'), F.col('state')))