python python-3.x pandas scikit-learn pca

Unusually high PCA result on DataFrame with 200 features

I am using PCA() implementation contained in sklearn on a dataframe that has 200 features. This dataframe was created with this code:

df = data.pivot_table(index='customer', columns='purchase', values='amount', aggfunc=sum)
df = df.reset_index().rename_axis(None, axis=1)
df = df.fillna(value=0)

Then, I have implemented PCA():

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

pca = PCA(n_components=1)
p = pca.fit(df)
sum(pca.explained_variance_ratio_)

In the end, I have obtained the result presented below:

0.99999940944358268

Am I wrong, or is it generally illogical for this result to be practical when the number of components is set to 1 out of 200?

More Questions

In short, is my data actually only leaning to the one feature?
What could be causing this?
Does summing the values of the features for each customer prior to running PCA affect this?
How should I restructure my data to overcome this seeming error?

Solution

You should read more about Principal Component Analysis in these sources:

Is it generally illogical for this result to be practical when the number of components is set to 1 out of 200?

It is possible to tweak the data with immense amount of features in a way that explained variance would be close to zero. To achieve that the features must be highly correlated between each other. In your case, I may assume two scenarios:

either there are a lot of missing values, as you fill them with zeros (not a state-of-the-art approach) which creates a spot for a higher relation;
either your data is really highly correlated, so PCA() well aggregates the information of the 200 features in a new feature.
either there is simply a problem with your data.

In short, is my data actually only leaning to the one feature?

What could be causing this?

As stated above PCA does not work with the original features as it creates new ones summarizing as much information as possible from the data. Thus, it does not actually lean to the one default feature.

I would suggest you to perform some data preprocessing as ~99% of explained variance ratio with 1 characteristic looks terribly suspicious. This could be caused by statements above.

Does summing the values of the features for each customer prior to running PCA affect this?

Any data manipulation affects the decomposition except certain cases like adding the same positive integer to a set of positive integers, and so on. You should apply PCA to your data prior and after the sum operation to observe the effect.

How should I restructure my data to overcome this seeming error?

First of all, I would suggest another approach to fulfill the data. You could insert the missing values column by column using a mean or a median. Secondly, you should understand what features actually mean and if it is possible to drop some of them before the decomposition. You could also implement scaling techniques and / or normalization techniques. But these should be usually tested prior and after the model fitting as they also affect the model metrics.