Search code examples
machine-learningscikit-learnlinear-algebramulti-dimensional-scalingskbio

Why is `sklearn.manifold.MDS` random when `skbio's pcoa` is not?


I'm trying to figure out how to implement Principal Coordinate Analysis with various distance metrics. I stumbled across both skbio and sklearn with implementations. I don't understand why sklearn's implementation is different everytime while skbio is the same? Is there a degree of randomness to Multidimensional Scaling and in particular Principal Coordinate Analysis? I see that all of the clusters are very similar but why are they different? Am I implementing this correctly?

Running Principal Coordinate Analysis using Scikit-bio (i.e. Skbio) always gives the same results:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_data, metric="braycurtis")) # (n x n) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.index)
PCoA = skbio.stats.ordination.pcoa(DM_dist)

enter image description here

Now with sklearn's Multidimensional Scaling:

from sklearn.manifold import MDS
fig, ax=plt.subplots(ncols=5, figsize=(12,3))
for rs in range(5):
    M = MDS(n_components=2, metric=True, random_state=rs, dissimilarity='precomputed')
    A = M.fit(Ar_dist).embedding_
    ax[rs].scatter(A[:,0],A[:,1], c=[{0:"b", 1:"g", 2:"r"}[t] for t in Se_targets])

enter image description here


Solution

  • scikit-bio's PCoA (skbio.stats.ordination.pcoa) and scikit-learn's MDS (sklearn.manifold.MDS) use entirely different algorithms to transform the data. scikit-bio directly solves a symmetric eigenvalue problem and scikit-learn uses an iterative minimization procedure [1].

    scikit-bio's PCoA is deterministic, though it is possible to receive different (arbitrary) rotations of the transformed coordinates depending on the system it is executed on [2]. scikit-learn's MDS is stochastic by default unless a fixed random_state is used. random_state appears to be used to initialize the iterative minimization procedure (the scikit-learn docs say that random_state is used to "initialize the centers" [3] though I don't know exactly what that means). Each random_state may produce slightly different embeddings with arbitrary rotation [4].

    References: [1], [2], [3], [4]