I am using PCA()
implementation contained in sklearn
on a dataframe that has 200 features. This dataframe was created with this code:
df = data.pivot_table(index='customer', columns='purchase', values='amount', aggfunc=sum)
df = df.reset_index().rename_axis(None, axis=1)
df = df.fillna(value=0)
Then, I have implemented PCA()
:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
p = pca.fit(df)
sum(pca.explained_variance_ratio_)
In the end, I have obtained the result presented below:
0.99999940944358268
Am I wrong, or is it generally illogical for this result to be practical when the number of components is set to 1 out of 200?
You should read more about Principal Component Analysis in these sources:
Is it generally illogical for this result to be practical when the number of components is set to 1 out of 200?
It is possible to tweak the data with immense amount of features in a way that explained variance would be close to zero. To achieve that the features must be highly correlated between each other. In your case, I may assume two scenarios:
PCA()
well aggregates the information of the 200 features in a new feature.In short, is my data actually only leaning to the one feature?
What could be causing this?
As stated above PCA
does not work with the original features as it creates new ones summarizing as much information as possible from the data. Thus, it does not actually lean to the one default feature.
I would suggest you to perform some data preprocessing as ~99% of explained variance ratio with 1 characteristic looks terribly suspicious. This could be caused by statements above.
Does summing the values of the features for each customer prior to running PCA affect this?
Any data manipulation affects the decomposition except certain cases like adding the same positive integer to a set of positive integers, and so on. You should apply PCA
to your data prior and after the sum operation to observe the effect.
How should I restructure my data to overcome this seeming error?
First of all, I would suggest another approach to fulfill the data. You could insert the missing values column by column using a mean or a median. Secondly, you should understand what features actually mean and if it is possible to drop some of them before the decomposition. You could also implement scaling techniques and / or normalization techniques. But these should be usually tested prior and after the model fitting as they also affect the model metrics.