Search code examples
rmeanmissing-dataknnimputation

Problems with imputing missing values with kNN in r


I want to impute missing values with the mean of closest neighbors, but when I try kNN, it gives an error message.

So the vector is Stock Price, meaning I have NA on the weekends. I want to replace NA values (Saturday, Sunday) with the concave function: (Friday value + Monday Value)/2. I thought the kNN function with k=2 will be appropriate, but I get an error message.

> Oriental_Stock$Stock
 [1] 42.80 43.05 43.00 43.00 42.20    NA    NA 42.50 40.00 40.25 40.55 
 41.50    NA    NA 40.85
> kNN(Oriental_Stock, variable = colnames("Stock"), k = 2)
Error in `[.data.table`(data, indexNA2s[, variable[i]], `:=`(imp_vars[i],  
 : i is invalid type (matrix). Perhaps in future a 2 column matrix could 
  return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). 
  Please report to data.table issue tracker if you'd like this, or add 
  your comments to FR #657.

Please let me know whether it's possible to do this and maybe there are easier options than kNN. I'm not a Data Scientist, just a student, so I don't know much about this. Thank you in advance for any suggestions!


Solution

  • Knn would work on a data.frame where it picks the neighbours based on the distances between your rows. It doesn't work on a vector.

    A for-loop could be a fair solution for this:

    #this finds the locations of the first NA of each couple of NAs
    #the TRUE / FALSE part below picks only the first NA from each couple
    idx <- which(is.na(stock))[c(TRUE, FALSE)]
    
    #this iterates over the above indexes and calculates the mean and updates the NAs
    for (x in idx) {
      stock[x] <- stock[x+1] <- (stock[x-1] + stock[x+2]) / 2
    }
    

    Result:

    > stock
     [1] 42.800 43.050 43.000 43.000 42.200 42.350 42.350 42.500 40.000
    [10] 40.250 40.550 41.500 41.175 41.175 40.850
    

    I used stock as the data:

    stock <- c(42.80,43.05, 43.00, 43.00, 42.20,    NA,    NA, 42.50, 40.00, 40.25, 40.55, 
               41.50,    NA,    NA, 40.85)