I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.
How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).
eg:
X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)
If the dataset is padded with zeros for most feature values - then will pca be valid?
I see 3 options, however none is a solution for your problem:
1) You replace the null values by 0, but that will definetly worsen your results;
2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;
3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.