Search code examples
machine-learninglinear-algebrascikitsdimensionality-reductionskbio

How to get `skbio` PCoA (Principal Coordinate Analysis) results?


I'm looking at the attributes of skbio's PCoA method (listed below). I am new to this API and I want to be able to get the eigenvectors and the original points projected onto the new axis similar to .fit_transform in sklearn.decomposition.PCA so I can create some PC_1 vs PC_2-style plots. I figured out how to get the eigvals and proportion_explained but features comes back as None.

Is that because it's in beta?

If there are any tutorials that use this, that would be greatly appreciated. I am a huge fan of scikit-learn and would like to start using more of scikit's products.

|  Attributes
 |  ----------
 |  short_method_name : str
 |      Abbreviated ordination method name.
 |  long_method_name : str
 |      Ordination method name.
 |  eigvals : pd.Series
 |      The resulting eigenvalues.  The index corresponds to the ordination
 |      axis labels
 |  samples : pd.DataFrame
 |      The position of the samples in the ordination space, row-indexed by the
 |      sample id.
 |  features : pd.DataFrame
 |      The position of the features in the ordination space, row-indexed by
 |      the feature id.
 |  biplot_scores : pd.DataFrame
 |      Correlation coefficients of the samples with respect to the features.
 |  sample_constraints : pd.DataFrame
 |      Site constraints (linear combinations of constraining variables):
 |      coordinates of the sites in the space of the explanatory variables X.
 |      These are the fitted site scores
 |  proportion_explained : pd.Series
 |      Proportion explained by each of the dimensions in the ordination space.
 |      The index corresponds to the ordination axis labels

Here is my code to generate the principal component analysis object.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_standard.T, metric="braycurtis")) # (m x m) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.columns)
PCoA = skbio.stats.ordination.pcoa(DM_dist)

enter image description here


Solution

  • You can access the transformed sample coordinates with OrdinationResults.samples. This will return a pandas.DataFrame row-indexed by sample ID (i.e. the IDs in your distance matrix). Since principal coordinate analysis operates on a distance matrix of samples, transformed feature coordinates (OrdinationResults.features) are not available. Other ordination methods in scikit-bio accepting a sample x feature table as input will have the transformed feature coordinates available (e.g. CA, CCA, RDA).

    Side note: the distance.squareform call is unnecessary because skbio.DistanceMatrix supports square- or vector-form arrays.