Search code examples
rkernel-density

Looking for ideal kernel NW estimate in R


The problem is simple. I have covariate x and some outcome y and I would like to find Nadarya-Watson estimate of y based on x. However, I would like to find a function which satisfies several conditions:

  1. Besides estimate it returns also weights
  2. It handles not only uniformly distributed points for which the estimate is provided.
  3. It is reasonably fast.

I can simply implement it by myself. My naive estimate function than looks something like this:

mNW <- function(x, X, Y, h, K = dnorm) {

  # Arguments
  # x: evaluation points
  # X: covariates
  # Y: outcome
  # h: bandwidth
  # K: kernel

  Kx <- sapply(X, function(Xi) K((x - Xi) / h))

  # Weights
  W <- Kx / rowSums(Kx) 

  # NW estimate
  m <- W %*% Y

  return(list(m = m, W = W))
}

set.seed(123)
X <- rnorm(1000)
Y <- 1 + X - 2*X^2 + rnorm(1000)
x <- c(-3, -2.1, -0.7, 0, 0.3, 0.8, 1, 1.9, 3.2)

mNW(x, X, Y, h = 0.5)

It works fine but it is slow. So I have tried to find something already implemented. First choice was kernsmooth:

ksmooth(X, Y, kernel = "normal", bandwidth = 0.5, x.points = x)

This one is faster, however it does not return weights. Moreover, it only uses "box" and "normal" kernels.

I have also tried locpoly from KernSmooth package:

locpoly(X, Y, drv = 0, kernel = "normal", bandwidth = 0.5, 
        gridsize = 9, range.x = c(-3, 3.2))

Besides it does not return weights, I was not able to run function for my own specification of x and I had to use equally spaced values in some specified range.

So I am wondering if there is something what I missing in these functions or whether there is another solution in R for NW estimate.


Solution

  • I coded your same example in Rcpp and it is way faster than the R function but it is slower than the ksmooth. Anyway it returns the 2 elements you wanted. I could not let the kernel as an input because it's kind of hard to do in Rcpp as you did in R but you can write a simple if else inside the Rcpp code depending the kernels you want to use ([here][1] is a list of the available distributions in R).

    The following is the C++ code that should be saved in a .cpp file and source into R with the Rcpp::sourceCpp()

    #include <RcppArmadillo.h>
    using namespace Rcpp;
    using namespace arma;
    
    // [[Rcpp::depends(RcppArmadillo)]]
    
    // [[Rcpp::export]]
    std::vector<arma::mat>  mNWCpp(Rcpp::NumericVector x, Rcpp::NumericVector X,Rcpp::NumericMatrix Y,
               double h){
    
      int number_loop = X.size();
      int number_x    = x.size();
    
      Rcpp::NumericMatrix Kx(number_x,number_loop);
    
      for(int i =0; i<number_loop;++i){
        Kx(_,i) = dnorm((x-X[i])/h);
      }
    
      Rcpp::NumericVector row_sums = rowSums(Kx);
      Rcpp::NumericMatrix W = Kx*0;
      for(int i =0; i<number_loop;++i){
        W(_,i) = Kx(_,i)/row_sums;
      }
    
    
      arma::mat weights = Rcpp::as<arma::mat>(W);
      arma::mat Ymat = Rcpp::as<arma::mat>(Y);
    
      arma::mat m = weights * Ymat;
    
      std::vector< arma::mat> res(2);
      res[0] = weights;
      res[1] = m;
      return res;
    }
    

    I use the package microbenchmark to compare the 3 functions and the result is as follows:

    Unit: microseconds
     expr    min      lq     mean median      uq    max neval
        R 1991.9 2040.25 2117.656 2070.9 2123.50 3492.5   100
     Rcpp  490.5  502.10  512.318  510.8  517.35  598.0   100
       KS  196.8  205.40  215.598  211.4  219.15  282.2   100