Search code examples
pysparkdatabricks

How to sum row wise data using single column in pysaprk


I have one dataframe as shown below.

df1:

id Name
1  1*1+0*1
2  1*0+0*0
3  0*0+1+1

The desired out put should be

df2:

1  1
2  0
3  2

How to achieve this using pysaprk dataframe.


Solution

  • You can use Python's eval function to evalutate the expression. Following is an example.

    import pyspark.sql.functions as F
    from pyspark.sql import SparkSession
    
    
    spark = SparkSession.builder.appName("InterpolateNulls").getOrCreate()
    
    
    data = [
        (1, '1*1+0*1'),
        (2, '1*0+0*0'),
        (3, '0*0+1+1')
    ]
    schema = ["id", "Name"]
    df1 = spark.createDataFrame(data=data, schema=schema)
    
    
    def evaluate_exp(given_exp):
        return eval(given_exp)
    
    
    evaluate_exp_udf = F.udf(evaluate_exp)
    
    df1.show(n=100, truncate=False)
    
    df_result = df1.withColumn("result_python", evaluate_exp_udf(F.col("Name")))
    print("Result using PYTHON UDF")
    df_result.show(n=100, truncate=False)
    

    Output :

    +---+-------+
    |id |Name   |
    +---+-------+
    |1  |1*1+0*1|
    |2  |1*0+0*0|
    |3  |0*0+1+1|
    +---+-------+
    
    Result using PYTHON UDF
    +---+-------+-------------+
    |id |Name   |result_python|
    +---+-------+-------------+
    |1  |1*1+0*1|1            |
    |2  |1*0+0*0|0            |
    |3  |0*0+1+1|2            |
    +---+-------+-------------+