Search code examples
pythonpysparkscikit-learndatabricksloss-function

Sklearn mean_absolute_error gives always different values on the same data for different trials using data from spark df. How can I solve it?


I want to calculate the mean average error (MAE) of to columns of a spark dataframe, but I cannot use from pyspark.mllib.evaluation import RegressionMetrics because I have high concurrency cluster. Therefore, I use sklearn and convert to columns to pandas.

Here is my code:

mean_absolute_error(test.select("qty").toPandas(),test.select("pred").toPandas()) 
mean_absolute_percentage_error(test.select("qty").toPandas(),test.select("pred").toPandas())

Both MAE and MAPE gives different values every time I run it. What can be the reason? How can I solve it?

Btw: I am using databricks and I cannot share the data


Solution

  • When running .toPandas() the entire Spark dataframe is moved into the Driver node as a pandas DataFrame.
    This is a very expansive operation that also makes use of the Drivers memory, so be careful.
    In addition there is no guarantee what would be the order of the rows in the pandas df.

    I think what happens is you run .toPandas() 4 times and each time you get the DataFrame in a different order.
    this causes the changing results.

    If you have to use pandas I recommend going tmp = test.select('qty', 'pred').toPandas() only once and then subsetting it to the MAE and MAPE functions.