Search code examples
rintervalssapply

Check which interval values lies within for each day, and create new matrix containing information of which interval in R


I am working with a dataset containing price information of over 700 cryptocurrencies from the period 2014-01-01 to 2019-12-31 called price.daily

            Bitcoin     Ethereum    XRP      Bitcoin.SV    Stellar    ...
   ...
2018-01-01  13657.20    772.64      2.39     NA            0.480008
2018-01-02  14982.10    884.44      2.48     NA            0.564766
2018-01-03  15201.00    962.72      3.11     NA            0.896227
   ... 

I have then for each day calculated the quantiles using sapply as another question suggested, and this works just fine

col.daily <- seq(1,length(price.daily$Bitcoin))
quantile.daily = sapply(col.daily, function(y) {quantile(x = unlist(price.daily[y,] ), seq(0,1, length=6),na.rm = TRUE )})
quantile.daily.t = t(quantile.daily)
rownames(quantile.daily.t) = rownames(price.daily)

From which I get the number from where my intervals whould be

             0%         20%         40%         60%         80%     100%
   ...
2018-01-01   2.60e-05   0.1681120   0.7189722   2.3060000   9.392   13657.20
2018-01-02   3.40e-05   0.1946376   0.7232178   2.4240000   10.092  14982.10
2018-01-03   3.80e-05   0.1982452   0.7771724   2.4820000   10.054  15201.00
   ...

What I then want to do is for each day take the price of each cryptocurrency, and check which interval it lies within, and create a new matrix containing number 1 to 5 and NA if no data is available. Should come out as

            Bitcoin   Ethereum    XRP     Bitcoin.SV    Stellar   ...
   ...
2018-01-01  5         5           4       NA            2
2018-01-02  5         5           4       NA            2
2018-01-03  5         5           4       NA            3
   ...

I imagine I for this also could use sapply?

A sample of my data using dput(head(price.daily)) for my price.daily data

structure(list(Bitcoin = c(771.4, 802.39, 818.72, 859.51, 933.53, 
953.29), Ethereum = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), XRP = c(0.026944, 0.028726, 0.027627, 0.028129, 
0.02523, 0.0257), Bitcoin.Cash = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), row.names = c("2014-01-01", 
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"
), class = "data.frame")

and for quantiles

structure(c(0.00044, 0.000353, 0.000303, 0.000301, 0.000271, 
0.00001, 0.0330034, 0.0319948, 0.0327684, 0.0318646, 0.0274614, 
0.0237276, 0.161692, 0.1793948, 0.163744, 0.1610448, 0.1579238, 
0.0728448, 3.014, 3.728, 3.85, 3.87, 3.814, 2.54200000000001, 
6.036, 7.578, 7.14, 7.434, 7.474, 7.188, 771.4, 802.39, 818.72, 
859.51, 933.53, 953.29), .Dim = c(6L, 6L), .Dimnames = list(c("2014-01-01", 
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"
), c("0%", "20%", "40%", "60%", "80%", "100%")))

Solution

  • The function findInterval does exactly what you need. The only difficulty comes from applying it to the correct data.

    Simple solution with a loop:

    result_loop = price.daily
    for (i in seq_len(nrow(price.daily))) {
      result_loop[i, ] = findInterval(price.daily[i, ], quantile.daily[i, ])
    }
    

    Alternative solution without a loop:

    combined = cbind(price.daily, quantile.daily)
    result_alternative = as.data.frame(t(apply(combined, 1, function(x) findInterval(x[1:ncol(price.daily)], x[(1 + ncol(price.daily)):ncol(combined)]))))
    colnames(result_alternative) = colnames(price.daily)
    

    The second solution (inspired by this answer to a similar question) has some additional problems like the memory overhead for the combined variable. Even if that weren't the case, I would still use the first solution. It can be tempting to use language constructs to avoid loops, but it makes debugging and maintenance much harder in many cases.

    As a side note: The result can be a matrix instead of a data frame, but since price.daily is (unnecessarily) given as a data frame, I opted for using the same class for the result.