apache-spark machine-learning pyspark cluster-analysis apache-spark-ml

How can I compare KMeans model performance with GaussianMixture and LDA model performances in pyspark?

I am working on iris dataset using pyspark.ml.clustering library in order to understand fundamentals of pyspark and create a clustering template for me.

My spark version is 2.1.1 and i have hadoop 2.7.

I know that KMeans and BisectingKMeans have computeCost() method which gives model performance based on the sum of squared distances between the input points and their corresponding cluster centers.

Is there a way to compare KMeans model performance with GaussianMixture and LDA model performances on iris dataset in order to choose best model type (KMeans , GaussianMixture or LDA) ?

Solution

Short answer: no

Long answer:

You are trying to compare apples with oranges here: in Gaussian Mixtures & LDA models there is no concept of cluster center at all; hence, it is not strange that a function similar to computeCost() does not exist.

It is easy to see this, if you look at the actual output of a Gaussian Mixture model; adapting the example from the documentation:

from pyspark.ml.clustering import GaussianMixture
from pyspark.ml.linalg import Vectors

data = [(Vectors.dense([-0.1, -0.05 ]),),
         (Vectors.dense([-0.01, -0.1]),),
         (Vectors.dense([0.9, 0.8]),),
         (Vectors.dense([0.75, 0.935]),),
         (Vectors.dense([-0.83, -0.68]),),
         (Vectors.dense([-0.91, -0.76]),)]

df = spark.createDataFrame(data, ["features"])
gm = GaussianMixture(k=3, tol=0.0001,maxIter=10, seed=10) # here we ask for k=3 gaussians
model = gm.fit(df)

transformed_df = model.transform(df)  # assign data to gaussian components ("clusters")
transformed_df.collect()

# Here's the output:

[Row(features=DenseVector([-0.1, -0.05]), prediction=1, probability=DenseVector([0.0, 1.0, 0.0])), 
 Row(features=DenseVector([-0.01, -0.1]), prediction=2, probability=DenseVector([0.0, 0.0007, 0.9993])),
 Row(features=DenseVector([0.9, 0.8]), prediction=0, probability=DenseVector([1.0, 0.0, 0.0])), 
 Row(features=DenseVector([0.75, 0.935]), prediction=0, probability=DenseVector([1.0, 0.0, 0.0])), 
 Row(features=DenseVector([-0.83, -0.68]), prediction=1, probability=DenseVector([0.0, 1.0, 0.0])), 
 Row(features=DenseVector([-0.91, -0.76]), prediction=2, probability=DenseVector([0.0, 0.0006, 0.9994]))]

The actual output of a Gaussian Mixture "clustering" is the third feature above, i.e. the probability column: it is a 3-dimensional vector (because we asked for k=3), showing the "degree" to which the specific data point belongs to each one of the 3 "clusters". In general, the vector components will be less than 1.0, and that's why Gaussian Mixtures are a classic example of "soft clustering" (data points belonging to more than one cluster, to each one by some degree). Now, some implementations (including the one in Spark here) go a step further and assign a "hard" cluster membership (feature prediction above), by simply taking the index of the maximum component in probability - but that is simply an add-on.

What about the output of the model itself?

model.gaussiansDF.show()

+--------------------+--------------------+ 
|                mean|                 cov| 
+--------------------+--------------------+ 
|[0.82500000000150...|0.005625000000006...|  
|[-0.4649980711427...|0.133224999996279...|
|[-0.4600024262536...|0.202493122264028...| 
+--------------------+--------------------+

Again, it is easy to see that there are no cluster centers, only the parameters (mean and covariance) of our k=3 gaussians.

Similar arguments hold for the LDA case (not shown here).

It is true that the Spark MLlib Clustering Guide claims that the prediction column includes the "Predicted cluster center", but this term is quite unfortunate, to put it mildly (to put it bluntly, it is plain wrong).

Needless to say, the above discussion comes directly from the core concepts & theory behind Gaussian Mixture Models, and it is not specific to the Spark implementation...

Functions like computeCost() are there merely to help you evaluate different realizations of K-Means (due to different initializations and/or random seeds), as the algorithm may converge to a non-optimal local minimum.