Search code examples
rrangeoverlap

Counting unique groups based on overlapping ranges


I am aiming to make a timeline chart that shows the number of unique insect species for different periods of time. The timeline will be binned into 500 year chunks, dated as Before Present (BP) (i.e. 0 - 500, 501 - 1000, etc.). For now, lets say between 0 - 5000.

I have a dataset with samples of insects, and each sample has a species name and date. The date is in the form of a range, with the start date (age_younger) in one column and the end date (age_older) in another. My issue is that I now need to compare each sample's date range against my timeline bins, see which bins the range overlaps, and count that species only once. Not sample, but the species, i.e. the presence of the grouping variable in that timeline bin.

An example of my dataset would look something like this:

species <- 
  c("Aphodius prodromus","Aphodius prodromus","Xestobium rufovillosum","Aphodius ater","Phyllobius calcaratus","Phyllobius calcaratus","Cercyon unipunctatus","Aphodius prodromus","Phyllobius calcaratus","Lochmaea suturalis","Micrelus ericae","Sericus brunneus","Gabrius breviventer","Anotylus tetracarinatus")

age_older <- 
  c(1550,1550,1550,4100,2119,2119,4100,5000,4347,1476,3564,2567,-54,-19)

age_younger <- 
  c(400,400,400,3808,1974,1974,3808,4000,4051,867,3345,2035,-67,-54)

test = data.frame(species, age_older, age_younger)

A potential issue is that in the full dataset negative age values will indicate years after 1950, meaning that the lower a negative value the closer to today it will be. And that some instances of zero-length age ranges will exist (i.e. some dates represent just one year, e.g. 3569 - 3569).

What I want is a data frame with the predetermined date bins for the timeline chart (for this example: between 0 - 5000), each row representing 500 years, and then a count value of the number of unique insect species for that 500 year slice.

So something like this:

age_bin species_count
0 235
500 678
1000 1456
... ...

I have looked into the iRanges package which seems quite good for counting overlaps in general, but I am struggling with finding out how to count based on a grouping variable.

I am able to generate a count of overlaps overall using the code below and the iRanges package:

library(IRanges)
library(plyranges) #plyranges for using dplyr

#species date ranges
ir1 = with(test, IRanges(age_younger, age_older, names = species))

#timeline bin ranges, from 0 - 5000 BP
ir2 = IRanges(start = seq(0,4501,500), end = seq(500,5000,500))

#calculating number of overlaps
overlaps <- ir2 %>% mutate(n_overlaps = count_overlaps(., ir1))   
overlaps

IRanges object with 10 ranges and 1 metadata column:
           start       end     width | n_overlaps
       <integer> <integer> <integer> |  <integer>
   [1]         0       500       501 |          3
   [2]       500      1000       501 |          4
   [3]      1000      1500       501 |          4
   [4]      1500      2000       501 |          5
   [5]      2000      2500       501 |          3
   [6]      2500      3000       501 |          1
   [7]      3000      3500       501 |          1
   [8]      3500      4000       501 |          4
   [9]      4000      4500       501 |          4
  [10]      4500      5000       501 |          1

Whilst this is a step in the right direction, it does not solve my main issue with relating these overlaps to the number of unique species. Any help is greatly appreciated.


Solution

  • Starting with the data that you gave, I generate a binned_data tibble, and loop over the bins to add the information. Also, rather than working with your test data directly, I create a test_distinct data frame, where duplicated row entries are removed. This simplifies the process and prevents multiple unique operations.

    In addition to counting the number of species in each bin, I store the species names as a character vector, with species separated by a semicolon.

    Lastly, in my binning analysis, the age_bin_start and age_bin_end are both inclusive in the filtering. You can adjust this as needed if you prefer to see left inclusive only bins as with IRanges.

    test_distinct <- test |> 
      dplyr::distinct()
    
    binned_data <- tibble::tibble(
      age_bin_start = c(0, seq(501, 4501, by = 500)),
      age_bin_end = seq(500, 5000, by = 500),
      n_distinct_species = 0,
      species_in_bin = NA
    )
    
    for (i in seq_len(nrow(binned_data))) {
      b_start <- binned_data[i, "age_bin_start", drop = TRUE]  # `drop = TRUE` here to pull the value as a vector from the tibble
      b_end <- binned_data[i, "age_bin_end", drop = TRUE]
      
      species_in_bin <- test_distinct |> 
        dplyr::filter(age_younger <= b_end & age_older >= b_start) |> 
        dplyr::pull(species)
      
      binned_data[i, "n_distinct_species"] <- length(species_in_bin)
      
      binned_data[i, "species_in_bin"] <- paste(species_in_bin, collapse = ";")
    }
    
    

    The results:

    > binned_data
    # A tibble: 10 × 4
       age_bin_start age_bin_end n_distinct_species species_in_bin                                                             
               <dbl>       <dbl>              <dbl> <chr>                                                                      
     1             0         500                  2 Aphodius prodromus;Xestobium rufovillosum                                  
     2           501        1000                  3 Aphodius prodromus;Xestobium rufovillosum;Lochmaea suturalis               
     3          1001        1500                  3 Aphodius prodromus;Xestobium rufovillosum;Lochmaea suturalis               
     4          1501        2000                  3 Aphodius prodromus;Xestobium rufovillosum;Phyllobius calcaratus            
     5          2001        2500                  2 Phyllobius calcaratus;Sericus brunneus                                     
     6          2501        3000                  1 Sericus brunneus                                                           
     7          3001        3500                  1 Micrelus ericae                                                            
     8          3501        4000                  4 Aphodius ater;Cercyon unipunctatus;Aphodius prodromus;Micrelus ericae      
     9          4001        4500                  4 Aphodius ater;Cercyon unipunctatus;Aphodius prodromus;Phyllobius calcaratus
    10          4501        5000                  1 Aphodius prodromus                        
    

    I'm using tibble and dplyr here as dependencies, but you don't need to use these packages. This can be solved in base R, just not as cleanly (in my opinion). Also, you can pull out the unique values from the IRanges package, but I believe that package is overkill for this problem.

    I hope this helps!

    Edit: I should also note that you've got two different age ranges for Aphodius prodromus. If this wasn't a mistake, then the problem could change a bit.