Search code examples
apache-sparkpysparkapache-spark-sql

How to change multiple column values to a constant with out specifying all column names?


I have a very wide dataframe in spark .It has 80 columns so I want to set a column to 0 and rest to 1.

So the one I want to set to 1 I tried it with

df = df.withColumn("set_zero_column", lit(0))

and it worked.

Now I want to set the rest columns to 1. How do I do without specify all the 79 names ?

Any help is appreciated


Solution

  • Use select with a list comprehension:

    from pyspark.sql.functions import lit
    
    set_one_columns = [lit(1).alias(c) for c in df.columns if c != "set_zero_column"]
    df = df.select(lit(0).alias("set_zero_column"), *set_one_columns)
    

    If you needed to maintain the original column order, you could do:

    cols = [lit(0).alias(c) if c == "set_zero_column" else lit(1).alias(c) for c in df.columns]
    df = df.select(*cols)