Search code examples
sqlpysparkdivide

dividing all columns in pyspark SQL


I am new to Spark and this might be a straightforward problem.

I have a table in pyspark DF1, with a column name "A" and some other columns. The other column names might change in various cases. I want to generate a table DF2 by dividing the values of these "other" columns over A. The columns of DF2 would also be the "other" columns.

For example for

DF1 = sql_sc.createDataFrame([(1, 2,3), (2, 4,6), (3, 6,9), (4, 8,12), (5, 10,15)], ["A", "B","C"])    

The result would be a table as:

 DF2 = sql_sc.createDataFrame([(2,3), (2,3), (2,3), (2,3), (2,3)], ["B","C"]

How can I do this task in pyspark SQL ?

Thanks


Solution

  • You can use DataFrame.columns to iterate over the columns and then just use the divide operator to perform your row-wise math. Like this:

    from pyspark.sql.functions import col
    
    DF1 = spark.createDataFrame([(1, 2,3), (2, 4,6), (3, 6,9), (4, 8,12), (5, 10,15)], ["A", "B","C"])  
    DF1.show()
    
    df = DF1
    for field in DF1.columns:
        if field != 'A':
            df = df.withColumn(field, col(field) / col("A"))
    
    DF2 = df.drop('A')
    DF2.show()
    

    Which outputs this:

    +---+---+---+
    |  A|  B|  C|
    +---+---+---+
    |  1|  2|  3|
    |  2|  4|  6|
    |  3|  6|  9|
    |  4|  8| 12|
    |  5| 10| 15|
    +---+---+---+
    
    +---+---+
    |  B|  C|
    +---+---+
    |2.0|3.0|
    |2.0|3.0|
    |2.0|3.0|
    |2.0|3.0|
    |2.0|3.0|
    +---+---+