I have two columns in a PySpark DataFrame and I want to take ratio of these two columns after filling null values (not inplace). Currently my DataFrame looks like as follows:
+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
| B| 10|200|null|null| 20|null|
| C|1000|100| 10|null|null|null|
| A| 100|200| 200| 200| 300| 10|
+----+----+---+----+----+----+----+
My desired output would look as follows:
+------+------+-----+------+------+------+------+-------+
| Acct | M1D | M1C | M2D | M2C | M3D | M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
| B | 10 | 200 | null | null | 20 | null | 0 |
| C | 1000 | 100 | 10 | null | null | null | 10 |
| A | 100 | 200 | 200 | 200 | 300 | 10 | 20 |
+------+------+-----+------+------+------+------+-------+
I want to take ratio of M2D
with M3C
to create new column Ratio
. Before taking ratio I want to fill M2D
with 0
and M3C
with 1
which would be performed on the fly in order to avoid nulls and to avoid replacing values inplace.
I tried doing this by using following code.
df = df.withColumn('Ratio', col('M2D').fillna(0, subset=['M2D']) / col('M3C').fillna(1, subset=['M3C']))
The above code gave me the following error
TypeError: 'Column' object is not callable
As stated by above error in order to avoid TypeError I tried the following line of code. Instead of column now I am using DataFrame.
df = df.withColumn('Ratio', df.select('M2D').fillna(0, subset=['M2D']) / df.select('M3C').fillna(1, subset=['M3C']))
The above code led to following error.
TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'
How can I achieve my desired output?
You should fill null values before you calculate the ratio, like this:
df = df.fillna(0, subset=['M2D'])\
.fillna(1, subset=['M3C'])\
.withColumn('Ratio', col('M2D') / col('M3C'))
Or even simpler, use coalesce
as follows if you just want to avoid nulls in calculation:
df = df.withColumn('Ratio', coalesce(col('M2D'), lit(0)) / coalesce(col('M3C'), lit(1)))