I am working with a data table that contains a distance matrix of meteorological stations (in metres). The distance from one station to the other ones is organized in each column of the data table.
Here's what the data looks like:
library(data.table)
df <- structure(list(`1` = c(NA, 183861.946148085, 455184.652723047,
84776.9412266853, 471037.870714541), `10` = c(183861.946148085,
NA, 323843.574006837, 268617.304324823, 456775.683807133), `100` = c(455184.652723047,
323843.574006837, NA, 527585.966416289, 757973.787598604), `1000` = c(84776.9412266853,
268617.304324823, 527585.966416289, NA, 498646.318627207), `1002` = c(471037.870714541,
456775.683807133, 757973.787598604, 498646.318627207, NA)), class = "data.frame", row.names = c(NA,
-5L))
dt <- setDT(df)
1 10 100 1000 1002
1: NA 183861.9 455184.7 84776.94 471037.9
2: 183861.95 NA 323843.6 268617.30 456775.7
3: 455184.65 323843.6 NA 527585.97 757973.8
4: 84776.94 268617.3 527586.0 NA 498646.3
5: 471037.87 456775.7 757973.8 498646.32 NA
where the first column shows the distances from station 1
to itself (zero) and to stations 10
, 100
, 1000
and 1002
. And henceforth, column-wise.
I am interested in finding out the number of stations that lie within 100, 150, 250 and 500 km from each station.
Hence, for each column I would like to determine the count (and proportion) of values lying within breaks <- c(100000, 150000, 250000, 500000)
.
The expected output would be:
ID distance count prop
1 0-100000 1 25
1 100000-150000 0 0
1 150000-250000 1 25
1 250000-500000 2 50
...
My real dataset has dimensions 1500 x 1500, so what is the fastest way to achieve the expected result?
Since your dataset is symmetric, we can only consider either upper/lower triangle.
library(dplyr)
df[upper.tri(df)] <- NA
We can get the data in long format, cut
the values based on breaks
, calculate the count
and proportion for each Id.
df %>%
tidyr::pivot_longer(cols = everything(),
values_drop_na = TRUE) %>%
count(name, value = cut(value, c(-Inf, breaks, Inf), c('0-100000',
'100000-150000', '150000-250000', '250000-500000', '500000+'))) %>%
group_by(name) %>%
mutate(prop = n/sum(n) * 100)
# name value n prop
# <chr> <fct> <int> <dbl>
#1 1 0-100000 1 25
#2 1 150000-250000 1 25
#3 1 250000-500000 2 50
#4 10 250000-500000 3 100
#5 100 500000+ 2 100
#6 1000 250000-500000 1 100