Search code examples
machine-learningdimensionality-reduction

How to reduce dimensions of new data/input after applying dimensionality reduction method like MCA


I have a categorical training set like this

col1   col2   col3   col4
 9      8      10     9
10      8       9     9
.....................

and after i reduced the dimensions by applying MCA(Multiple Correspondance Analysis) on it, i got something like this

dim1    dim2
0.857  -0.575
0.654   0.938
.............

Now my question is how to find the (dim1, dim2) of a new data like this as input ?

col1  col2   col3  col4
10     9       8     8

the outputs of MCA after performing on the training set is eigenvalues, inertia etc

My code in python:

from sklearn.cluster import KMeans
import prince
data = pd.read_csv("data/training set.csv")
X = data.loc[:, 'OS.1':'DSA.1']
size = len(X)
X = X.values.tolist()

#...
#data preprocessing
#...

df = pd.DataFrame(X)
mca = prince.MCA(
               n_components=2,
               n_iter=3,
               copy=True,
               check_input=True,
               engine='auto',
               random_state=42
                )

mca = mca.fit(df)
X = mca.transform(df)

km = KMeans(n_clusters=3)
km.fit(X)

1.I want to take an input from user 2.Preprocess it before performing dimensional reduction using MCA 3.predict it's cluster using K means


Solution

  • You just need to keep your MCA object mca alive to be able to use it to just transform new input data. To do that, just call the transform method on your new data

    from sklearn.cluster import KMeans
    import prince
    data = pd.read_csv("data/training set.csv")
    X = data.loc[:, 'OS.1':'DSA.1']
    size = len(X)
    X = X.values.tolist()
    
    #...
    #data preprocessing
    #...
    
    df = pd.DataFrame(X)
    mca = prince.MCA(
                   n_components=2,
                   n_iter=3,
                   copy=True,
                   check_input=True,
                   engine='auto',
                   random_state=42
                    )
    
    mca = mca.fit(df)
    X = mca.transform(df)
    
    km = KMeans(n_clusters=3)
    km.fit(X)
    
    # New data into x_new
    # 1. Preprocess x_new as you preprocessed x
    # Reuse mca on x_new
    df_new = pd.DataFrame(x_new)
    X_new = mca.transform(df_new)
    
    # predictions
    km.predict(X_new)