Search code examples
pythonscikit-learnpipelinepca

How to create a pipeline for PCA?


I'm working on some customer_data where I, as a first step, want to do PCA, followed by clustering as a second step.

Since there needs to be done encoding (and scaling) before feeding the data to the PCA, I thought it would be good to fit it all into a pipeline. - Which unfortunately doesn't seem to work.

How can I create this pipeline, and does it even make sense to do it like this?

# Creating pipeline objects 
encoder = OneHotEncoder(drop='first')
scaler = StandardScaler(with_mean=False)
pca = PCA()

# Create pipeline
pca_pipe = make_pipeline(encoder,
                         scaler,
                         pca)

# Fit data to pipeline
pca_pipe.fit_transform(customer_data_raw)

I get the following error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-c4ce88042a66> in <module>()
     20 
     21 # Fit data to pipeline
---> 22 pca_pipe.fit_transform(customer_data_raw)

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/decomposition/_pca.py in _fit(self, X)
    385         # This is more informative than the generic one raised by check_array.
    386         if issparse(X):
--> 387             raise TypeError('PCA does not support sparse input. See '
    388                             'TruncatedSVD for a possible alternative.')
    389 

TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.

Solution

  • OneHotEncoder creates a sparse matrix on transform by default. From there the error message is pretty straightforward: you can try TruncatedSVD instead of PCA. However, you could also set sparse=False in the encoder if you want to stick to PCA.

    That said, do you really want to one-hot encode every feature? And then scale those dummy variables? Consider using a ColumnTransformer if you'd like to encode some features and scale others.