I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column).
This is the dataset with missing data http://pastebin.com/YS9qaUPs
#load data
dataMiss = read.csv('dataMiss.csv')
#impute data
SVDimputation = round(impute.svd(dataMiss)$x, 2)
#find index of missing values
bool = apply(X = dataMiss, 2, is.na)
#put in a new data frame only the imputed value
SVDImpNA = mapply(function(x,y) x[y], as.data.frame(SVDimputation), as.data.frame(bool))
View(SVDImpNA)
head(SVDImpNA)
V1 V2 V3
[1,] -0.01 0.01 0.01
[2,] -0.01 0.01 0.01
[3,] -0.01 0.01 0.01
[4,] -0.01 0.01 0.01
[5,] -0.01 0.01 0.01
[6,] -0.01 0.01 0.01
Where am I wrong?
The impute.svd
algorithm works as follows:
Replace all missing values with the corresponding column means.
Compute a rank-k
approximation to the imputed matrix.
Replace the values in the imputed positions with the corresponding values from the rank-k
approximation computed in Step 2.
Repeat Steps 2 and 3 until convergence.
In your example code, you are setting k=min(n,p)
(the default). Then, in Step 2, the rank-k
approximation is exactly equal to imputed matrix. The algorithm converges after 0 iterations. That is, the algorithm sets all imputed entries to be the column means (or something extremely close to this if there is numerical error).
If you want to do something other than impute the missing values with the column means, you need to use a smaller value for k
. The following code demonstrates this with your sample data:
> library("bcv")
> dataMiss = read.csv('dataMiss.csv')
k=3
> SVDimputation = impute.svd(dataMiss, k = 3, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-0.01 0.01
531 1062
k=2
> SVDimputation = impute.svd(dataMiss, k = 2, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-11.31 -6.94 -2.59 -2.52 -2.19 -2.02 -1.67 -1.63
25 23 61 2 54 23 5 44
-1.61 -1.2 -0.83 -0.8 -0.78 -0.43 -0.31 -0.15
14 10 13 19 39 1 14 19
-0.14 -0.02 0 0.01 0.02 0.03 0.06 0.17
83 96 94 77 30 96 82 28
0.46 0.53 0.55 0.56 0.83 0.91 1.26 1.53
1 209 83 23 28 111 16 8
1.77 5.63 9.99 14.34
112 12 33 5
Note that for your data, the default maximum number of iterations (100) was too low (I got a warning message). To fix this, I set maxiter=10000
.