Search code examples
pandasscikit-learnpcasklearn-pandas

scikit-learn PCA with unknown feature values


I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.

How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).

eg:

X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)

If the dataset is padded with zeros for most feature values - then will pca be valid?


Solution

  • I see 3 options, however none is a solution for your problem:

    1) You replace the null values by 0, but that will definetly worsen your results;

    2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;

    3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.