I am working with a dataset containing price information of over 700 cryptocurrencies from the period 2014-01-01 to 2019-12-31 called price.daily
Bitcoin Ethereum XRP Bitcoin.SV Stellar ...
...
2018-01-01 13657.20 772.64 2.39 NA 0.480008
2018-01-02 14982.10 884.44 2.48 NA 0.564766
2018-01-03 15201.00 962.72 3.11 NA 0.896227
...
I have then for each day calculated the quantiles using sapply as another question suggested, and this works just fine
col.daily <- seq(1,length(price.daily$Bitcoin))
quantile.daily = sapply(col.daily, function(y) {quantile(x = unlist(price.daily[y,] ), seq(0,1, length=6),na.rm = TRUE )})
quantile.daily.t = t(quantile.daily)
rownames(quantile.daily.t) = rownames(price.daily)
From which I get the number from where my intervals whould be
0% 20% 40% 60% 80% 100%
...
2018-01-01 2.60e-05 0.1681120 0.7189722 2.3060000 9.392 13657.20
2018-01-02 3.40e-05 0.1946376 0.7232178 2.4240000 10.092 14982.10
2018-01-03 3.80e-05 0.1982452 0.7771724 2.4820000 10.054 15201.00
...
What I then want to do is for each day take the price of each cryptocurrency, and check which interval it lies within, and create a new matrix containing number 1 to 5 and NA if no data is available. Should come out as
Bitcoin Ethereum XRP Bitcoin.SV Stellar ...
...
2018-01-01 5 5 4 NA 2
2018-01-02 5 5 4 NA 2
2018-01-03 5 5 4 NA 3
...
I imagine I for this also could use sapply?
A sample of my data using dput(head(price.daily))
for my price.daily data
structure(list(Bitcoin = c(771.4, 802.39, 818.72, 859.51, 933.53,
953.29), Ethereum = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), XRP = c(0.026944, 0.028726, 0.027627, 0.028129,
0.02523, 0.0257), Bitcoin.Cash = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), row.names = c("2014-01-01",
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"
), class = "data.frame")
and for quantiles
structure(c(0.00044, 0.000353, 0.000303, 0.000301, 0.000271,
0.00001, 0.0330034, 0.0319948, 0.0327684, 0.0318646, 0.0274614,
0.0237276, 0.161692, 0.1793948, 0.163744, 0.1610448, 0.1579238,
0.0728448, 3.014, 3.728, 3.85, 3.87, 3.814, 2.54200000000001,
6.036, 7.578, 7.14, 7.434, 7.474, 7.188, 771.4, 802.39, 818.72,
859.51, 933.53, 953.29), .Dim = c(6L, 6L), .Dimnames = list(c("2014-01-01",
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"
), c("0%", "20%", "40%", "60%", "80%", "100%")))
The function findInterval does exactly what you need. The only difficulty comes from applying it to the correct data.
Simple solution with a loop:
result_loop = price.daily
for (i in seq_len(nrow(price.daily))) {
result_loop[i, ] = findInterval(price.daily[i, ], quantile.daily[i, ])
}
Alternative solution without a loop:
combined = cbind(price.daily, quantile.daily)
result_alternative = as.data.frame(t(apply(combined, 1, function(x) findInterval(x[1:ncol(price.daily)], x[(1 + ncol(price.daily)):ncol(combined)]))))
colnames(result_alternative) = colnames(price.daily)
The second solution (inspired by this answer to a similar question) has some additional problems like the memory overhead for the combined
variable. Even if that weren't the case, I would still use the first solution. It can be tempting to use language constructs to avoid loops, but it makes debugging and maintenance much harder in many cases.
As a side note: The result can be a matrix instead of a data frame, but since price.daily
is (unnecessarily) given as a data frame, I opted for using the same class for the result.