Search code examples
python-3.xpandasscikit-learnpca

Using mca package in Python


I am trying to use the mca package to do multiple correspondence analysis in Python.

I am a bit confused as to how to use it. With PCA I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transform unseen data.

Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E, .L, .K, .k etc).

So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like

import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))

from what I can gather

ca.fs_r(1)

is the transformation of the data in df and

ca.L

is supposed to be the eigenvalues (although I get a vector of 1s that is one element fewer that my number of features?).

now if I had some more data with the same features, let's say df_new and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1) for the new data


Solution

  • The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new) should be used to project new (unseen) data onto the factors obtained in the analysis.

    1. The package author refers to new data as supplementary data which is the terminology used in following paper: Abdi, H., & Valentin, D. (2007). Multiple correspondence analysis. Encyclopedia of measurement and statistics, 651-657.
    2. The package has only two functions which accept new data as parameter DF: fs_r_sup(self, DF, N=None) and fs_c_sup(self, DF, N=None). The latter is to find the column factor scores.
    3. The usage guide demonstrates this based on a new data frame which has not been used throughout the component analysis.