I am aiming to make a timeline chart that shows the number of unique insect species for different periods of time. The timeline will be binned into 500 year chunks, dated as Before Present (BP) (i.e. 0 - 500, 501 - 1000, etc.). For now, lets say between 0 - 5000.
I have a dataset with samples of insects, and each sample has a species name and date. The date is in the form of a range, with the start date (age_younger) in one column and the end date (age_older) in another. My issue is that I now need to compare each sample's date range against my timeline bins, see which bins the range overlaps, and count that species only once. Not sample, but the species, i.e. the presence of the grouping variable in that timeline bin.
An example of my dataset would look something like this:
species <-
c("Aphodius prodromus","Aphodius prodromus","Xestobium rufovillosum","Aphodius ater","Phyllobius calcaratus","Phyllobius calcaratus","Cercyon unipunctatus","Aphodius prodromus","Phyllobius calcaratus","Lochmaea suturalis","Micrelus ericae","Sericus brunneus","Gabrius breviventer","Anotylus tetracarinatus")
age_older <-
c(1550,1550,1550,4100,2119,2119,4100,5000,4347,1476,3564,2567,-54,-19)
age_younger <-
c(400,400,400,3808,1974,1974,3808,4000,4051,867,3345,2035,-67,-54)
test = data.frame(species, age_older, age_younger)
A potential issue is that in the full dataset negative age values will indicate years after 1950, meaning that the lower a negative value the closer to today it will be. And that some instances of zero-length age ranges will exist (i.e. some dates represent just one year, e.g. 3569 - 3569).
What I want is a data frame with the predetermined date bins for the timeline chart (for this example: between 0 - 5000), each row representing 500 years, and then a count value of the number of unique insect species for that 500 year slice.
So something like this:
age_bin | species_count |
---|---|
0 | 235 |
500 | 678 |
1000 | 1456 |
... | ... |
I have looked into the iRanges package which seems quite good for counting overlaps in general, but I am struggling with finding out how to count based on a grouping variable.
I am able to generate a count of overlaps overall using the code below and the iRanges package:
library(IRanges)
library(plyranges) #plyranges for using dplyr
#species date ranges
ir1 = with(test, IRanges(age_younger, age_older, names = species))
#timeline bin ranges, from 0 - 5000 BP
ir2 = IRanges(start = seq(0,4501,500), end = seq(500,5000,500))
#calculating number of overlaps
overlaps <- ir2 %>% mutate(n_overlaps = count_overlaps(., ir1))
overlaps
IRanges object with 10 ranges and 1 metadata column:
start end width | n_overlaps
<integer> <integer> <integer> | <integer>
[1] 0 500 501 | 3
[2] 500 1000 501 | 4
[3] 1000 1500 501 | 4
[4] 1500 2000 501 | 5
[5] 2000 2500 501 | 3
[6] 2500 3000 501 | 1
[7] 3000 3500 501 | 1
[8] 3500 4000 501 | 4
[9] 4000 4500 501 | 4
[10] 4500 5000 501 | 1
Whilst this is a step in the right direction, it does not solve my main issue with relating these overlaps to the number of unique species. Any help is greatly appreciated.
Starting with the data that you gave, I generate a binned_data
tibble, and loop over the bins to add the information. Also, rather than working with your test
data directly, I create a test_distinct
data frame, where duplicated row entries are removed. This simplifies the process and prevents multiple unique
operations.
In addition to counting the number of species in each bin, I store the species names as a character vector, with species separated by a semicolon.
Lastly, in my binning analysis, the age_bin_start
and age_bin_end
are both inclusive in the filtering. You can adjust this as needed if you prefer to see left inclusive only bins as with IRanges.
test_distinct <- test |>
dplyr::distinct()
binned_data <- tibble::tibble(
age_bin_start = c(0, seq(501, 4501, by = 500)),
age_bin_end = seq(500, 5000, by = 500),
n_distinct_species = 0,
species_in_bin = NA
)
for (i in seq_len(nrow(binned_data))) {
b_start <- binned_data[i, "age_bin_start", drop = TRUE] # `drop = TRUE` here to pull the value as a vector from the tibble
b_end <- binned_data[i, "age_bin_end", drop = TRUE]
species_in_bin <- test_distinct |>
dplyr::filter(age_younger <= b_end & age_older >= b_start) |>
dplyr::pull(species)
binned_data[i, "n_distinct_species"] <- length(species_in_bin)
binned_data[i, "species_in_bin"] <- paste(species_in_bin, collapse = ";")
}
The results:
> binned_data
# A tibble: 10 × 4
age_bin_start age_bin_end n_distinct_species species_in_bin
<dbl> <dbl> <dbl> <chr>
1 0 500 2 Aphodius prodromus;Xestobium rufovillosum
2 501 1000 3 Aphodius prodromus;Xestobium rufovillosum;Lochmaea suturalis
3 1001 1500 3 Aphodius prodromus;Xestobium rufovillosum;Lochmaea suturalis
4 1501 2000 3 Aphodius prodromus;Xestobium rufovillosum;Phyllobius calcaratus
5 2001 2500 2 Phyllobius calcaratus;Sericus brunneus
6 2501 3000 1 Sericus brunneus
7 3001 3500 1 Micrelus ericae
8 3501 4000 4 Aphodius ater;Cercyon unipunctatus;Aphodius prodromus;Micrelus ericae
9 4001 4500 4 Aphodius ater;Cercyon unipunctatus;Aphodius prodromus;Phyllobius calcaratus
10 4501 5000 1 Aphodius prodromus
I'm using tibble
and dplyr
here as dependencies, but you don't need to use these packages. This can be solved in base R, just not as cleanly (in my opinion). Also, you can pull out the unique values from the IRanges package, but I believe that package is overkill for this problem.
I hope this helps!
Edit: I should also note that you've got two different age ranges for Aphodius prodromus. If this wasn't a mistake, then the problem could change a bit.