Search code examples
apache-sparkapache-spark-mllibapache-spark-mlsvdnon-deterministic

Spark SVD is not reproducible


I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.

Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().

Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?

Thanks


Solution

  • Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:

    • FP arithmetic is not associative.

      scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
      res0: Boolean = false
      
    • Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.

      This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.

    Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.

    In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.