scikit-learn PCA with unknown feature values

I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.

How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).

eg:

X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)

If the dataset is padded with zeros for most feature values - then will pca be valid?

Solution

I see 3 options, however none is a solution for your problem:

1) You replace the null values by 0, but that will definetly worsen your results;

2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;

3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.