I am trying to use the mca package to do multiple correspondence analysis in Python.
I am a bit confused as to how to use it. With PCA
I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transform unseen data.
Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E
, .L
, .K
, .k
etc).
So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like
import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))
from what I can gather
ca.fs_r(1)
is the transformation of the data in df
and
ca.L
is supposed to be the eigenvalues (although I get a vector of 1
s that is one element fewer that my number of features?).
now if I had some more data with the same features, let's say df_new
and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1)
for the new data
The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new)
should be used to project new (unseen) data onto the factors obtained in the analysis.
DF
: fs_r_sup(self, DF, N=None)
and fs_c_sup(self, DF, N=None)
. The latter is to find the column factor scores.