Search code examples
apache-sparkapache-spark-sqlpysparkfillna

Create new column in Pyspark DataFrame by taking ratio of existing columns


I have two columns in a PySpark DataFrame and I want to take ratio of these two columns after filling null values (not inplace). Currently my DataFrame looks like as follows:

+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
|   B|  10|200|null|null|  20|null|
|   C|1000|100|  10|null|null|null|
|   A| 100|200| 200| 200| 300|  10|
+----+----+---+----+----+----+----+

My desired output would look as follows:

+------+------+-----+------+------+------+------+-------+
| Acct |  M1D | M1C |  M2D |  M2C |  M3D |  M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
|    B |   10 | 200 | null | null | 20   | null |     0 |
|    C | 1000 | 100 | 10   | null | null | null |    10 |
|    A |  100 | 200 | 200  | 200  | 300  | 10   |    20 |
+------+------+-----+------+------+------+------+-------+

I want to take ratio of M2D with M3C to create new column Ratio. Before taking ratio I want to fill M2D with 0 and M3C with 1 which would be performed on the fly in order to avoid nulls and to avoid replacing values inplace.

I tried doing this by using following code.

df = df.withColumn('Ratio', col('M2D').fillna(0, subset=['M2D']) / col('M3C').fillna(1, subset=['M3C']))

The above code gave me the following error

TypeError: 'Column' object is not callable

As stated by above error in order to avoid TypeError I tried the following line of code. Instead of column now I am using DataFrame.

df = df.withColumn('Ratio', df.select('M2D').fillna(0, subset=['M2D']) / df.select('M3C').fillna(1, subset=['M3C']))

The above code led to following error.

TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'

How can I achieve my desired output?


Solution

  • You should fill null values before you calculate the ratio, like this:

    df = df.fillna(0, subset=['M2D'])\
           .fillna(1, subset=['M3C'])\
           .withColumn('Ratio', col('M2D') / col('M3C'))
    

    Or even simpler, use coalesce as follows if you just want to avoid nulls in calculation:

    df = df.withColumn('Ratio', coalesce(col('M2D'), lit(0)) / coalesce(col('M3C'), lit(1)))