The problem is simple. I have covariate x
and some outcome y
and I would like to find Nadarya-Watson estimate of y
based on x
. However, I would like to find a function which satisfies several conditions:
I can simply implement it by myself. My naive estimate function than looks something like this:
mNW <- function(x, X, Y, h, K = dnorm) {
# Arguments
# x: evaluation points
# X: covariates
# Y: outcome
# h: bandwidth
# K: kernel
Kx <- sapply(X, function(Xi) K((x - Xi) / h))
# Weights
W <- Kx / rowSums(Kx)
# NW estimate
m <- W %*% Y
return(list(m = m, W = W))
}
set.seed(123)
X <- rnorm(1000)
Y <- 1 + X - 2*X^2 + rnorm(1000)
x <- c(-3, -2.1, -0.7, 0, 0.3, 0.8, 1, 1.9, 3.2)
mNW(x, X, Y, h = 0.5)
It works fine but it is slow. So I have tried to find something already implemented. First choice was kernsmooth
:
ksmooth(X, Y, kernel = "normal", bandwidth = 0.5, x.points = x)
This one is faster, however it does not return weights. Moreover, it only uses "box"
and "normal"
kernels.
I have also tried locpoly
from KernSmooth
package:
locpoly(X, Y, drv = 0, kernel = "normal", bandwidth = 0.5,
gridsize = 9, range.x = c(-3, 3.2))
Besides it does not return weights, I was not able to run function for my own specification of x
and I had to use equally spaced values in some specified range.
So I am wondering if there is something what I missing in these functions or whether there is another solution in R for NW estimate.
I coded your same example in Rcpp
and it is way faster than the R
function but it is slower than the ksmooth
. Anyway it returns the 2 elements you wanted. I could not let the kernel as an input because it's kind of hard to do in Rcpp
as you did in R
but you can write a simple if else
inside the Rcpp code depending the kernels you want to use ([here][1] is a list of the available distributions in R).
The following is the C++ code that should be saved in a .cpp file and source into R
with the Rcpp::sourceCpp()
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
std::vector<arma::mat> mNWCpp(Rcpp::NumericVector x, Rcpp::NumericVector X,Rcpp::NumericMatrix Y,
double h){
int number_loop = X.size();
int number_x = x.size();
Rcpp::NumericMatrix Kx(number_x,number_loop);
for(int i =0; i<number_loop;++i){
Kx(_,i) = dnorm((x-X[i])/h);
}
Rcpp::NumericVector row_sums = rowSums(Kx);
Rcpp::NumericMatrix W = Kx*0;
for(int i =0; i<number_loop;++i){
W(_,i) = Kx(_,i)/row_sums;
}
arma::mat weights = Rcpp::as<arma::mat>(W);
arma::mat Ymat = Rcpp::as<arma::mat>(Y);
arma::mat m = weights * Ymat;
std::vector< arma::mat> res(2);
res[0] = weights;
res[1] = m;
return res;
}
I use the package microbenchmark
to compare the 3 functions and the result is as follows:
Unit: microseconds
expr min lq mean median uq max neval
R 1991.9 2040.25 2117.656 2070.9 2123.50 3492.5 100
Rcpp 490.5 502.10 512.318 510.8 517.35 598.0 100
KS 196.8 205.40 215.598 211.4 219.15 282.2 100