Search code examples
rsparse-matrixquantilemapplyweighted

How can I get weighted quantiles from a sparsely populated data frame?


I'm looking for some help with adding some columns to a data frame that I can't seem to get right. My df is about 40,000 rows x 30 columns and is structured like this:

Key       0  1  2  3  4  5  6  7  8  9 10  11 12 13 14  15 16  17  18  19 20  21 22 23 24 25  NumNotNA
317862 2151 NA NA NA NA NA NA NA NA NA NA  NA NA NA NA  NA NA  NA  NA  NA NA  NA NA NA NA NA         0
317881 1105 51 33 98  8 27 56 28 17 39 11 112 14 87  7 273 56 189 245 119  7 147 25 NA NA NA        22
290507  111 56 28 NA NA NA NA NA NA NA NA  NA NA NA NA  NA NA  NA  NA  NA NA  NA NA NA NA NA         2
290509  196 NA NA NA NA NA NA NA NA NA NA  NA NA NA NA  NA NA  NA  NA  NA NA  NA NA NA NA NA         0
290520   91  0  2  1  4  1  1  0  0  0  1   1  4  0  1   0  5   0   1   0  1   0  0  4  1  0        25
290521   92  2  5 22  3  4  1 19  0  5  0   1 13  0  0   0  1   0   0   1  1   0 46 15  0  1        25

I am looking to add 11 new columns to the data frame, one for each weighted quantile in the sequence seq(0,1,0.1). I am using the following cosine weights to weight the observations from column names 1:25 in the df above.

cosine_weights <- 1 + cos(pi*seq(0,24)/24)

I am able to get the values I'm looking for by using the wtd.quantile() fucntion from both Hmisc and reldist packages on a single row, but I cannot seem to figure out how to use the apply functions correctly in a useful way to apply this to the entire data frame. Note, one of the issues is that several rows (like rows 1 & 4 above) contain all NA values for column names 1:25. For these rows, I would like my quantiles to yield NA values as well, but to yield valid quantiles for the other rows.

Here is an example of the output I'm looking for from each row, although I would like to append this as new columns to my existing data frame. I have tried to use this with mapply, but keep running into issues. I have to coerce the data as numeric to get this to compute, but when I try to apply it to the full df, it says that my (list) object cannot be coerced to type 'double'. Each fix I try seems to fail, and I cannot find any other similar examples through my searches.

> Hmisc::wtd.quantile(as.numeric(dat[6, 3:27]), weights=cosine_weights, probs = seq(0,1,0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 0.0  0.0  0.0  1.0  1.6  3.0  4.0  5.0  6.6 20.8 46.0

Thank you for your help!


Solution

  • We can write a function to claculate the values that we want

    calculate_quantile <- function(x) {
       if (all(is.na(x))) return(rep(NA, 11))
       as.numeric(Hmisc::wtd.quantile(x, weights=cosine_weights, probs = seq(0,1,0.1)))
    }
    

    and apply it row-wise using apply

    dat[paste0('Q', 1:11)] <- t(apply(dat[3:27], 1, calculate_quantile))