Search code examples
apache-sparkpysparkapache-spark-sqlcoalescepy4j

Py4JError: An error occurred while calling o230.and


Can anyone help with this error? I'm programming in Pyspark, and I'm trying to calculate a certain deviation with the following code:

Result =   data.select(count(((coalesce(data["pred"], lit(0)))!=0 & (coalesce(data["val"],lit(0)) !=0
& (abs(coalesce(data["pred"], lit(0)) - coalesce(data["val"],lit(0)))/(coalesce(data["val"],lit(0)))) > 0.1))))

The following error is coming up:

"Py4JError: An error occurred while calling o230.and. Trace:
py4j.Py4JException: Method and([class java.lang.Integer]) does not exist 
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)"

I am very very new at programming with pyspark and am not being able to discover what is wrong with my code at all; I made a very similar calculation with a very similar code that worked all right... does anyone know the problem?

PS This code, amongst others, is a different calculation with similar syntax that worked:

Abs_avg = data.select(avg(abs(coalesce(data["pred"], lit(0)) - coalesce(data["val"],lit(0)))))

Solution

  • You need to wrap the conditions in brackets, otherwise it will interpret as 0 & something. Also you don't need to wrap ... in (...) != 0.

    Result = data.select(
        count(
            (coalesce(data["pred"], lit(0)) != 0) & 
            (coalesce(data["val"], lit(0)) != 0) & 
            (abs(
                 coalesce(data["pred"], lit(0)) - 
                 coalesce(data["val"], lit(0))
                ) / coalesce(data["val"], lit(0)) > 0.1
            )
        )
    )