Search code examples
rdplyrmax

How to indicate a higher hierarchy level using dplyr?


I have a dataframe (df) in which each row represents the start (Start) and the end (End) of a specific habitat (Habitat) within a transect (Transect) and site (Site) in meters. It is important to note that the length of the transects varies within and among sites. As an example:

df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))

df

  Site Transect Habitat Start  End
1     A        1       X   0.0  2.8  # Habitat `X` is between the meters 0 and 2.8
2     A        1       Y   2.8  3.4  # Habitat `Y` is between the meters 2.8 and 3.4
3     A        1       X   3.4  5.0  # Habitat `X` is between the meters 3.4 and 5.0
4     A        1       Z   5.0 10.0  # Habitat `Z` is between the meters 5 and 10.0
5     A        2       Z   0.0  1.5
6     A        2       Y   1.5  5.0
7     A        2       X   5.0  8.0
8     A        2       Z   8.0 12.0
9     A        2       X  12.0 15.0
10    B        1       X   0.0  2.0
11    B        1       Z   2.0  5.0
12    B        1       X   5.0  7.5
13    B        1       Y   7.5 20.0
14    B        2       Z   0.0  4.0
15    B        2       X   4.0  8.0
16    B        2       Y   8.0 12.0
17    B        2       Z  12.0 15.0

In this example, for instance, habitat X is twice in the transect 1 in site A. Also, we can observe that the total length of transects 1 and 2 in site A are 10 and 15 m, respectively. In site B, the total length of the transects 1 and 2 are 20 and 15 meters, respectively.

What I want is to calculate per Site and Transect the percentage that each Habitat represents with respect to all the habitats presented in terms of meters. For example, in transect 1 and site A habitat X represents 4.4 meters of a total length of 10 meters for transect 1. In site A and transect 2, habitat X has 6 meters from a total length of 15 meters for transect B.

To this aim, the first thing I do is to calculate the length (Length) in meters of each habitat record (=row)

df$Length <- df$End - df$Start

Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:

df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Porcentage = (sum(Length)/max(End))*100))

I want to change max(End) to another expression that represents the total length OF THE TRANSECT. Right now max(End) represents the last meter (End) in which a specific habitat was present. How can I include in the code above "maximum value of End" but within of a specific Site and Transect, but not for a specific Habitat.

How can I do it? My desired output would be this:

   Site Transect Habitat Percentage
1     A        1       X       44.0
2     A        1       Y        6.0
3     A        1       Z       50.0
4     A        2       X       40.0
5     A        2       Y       23.3
6     A        2       Z       36.7
7     B        1       X       22.5
8     B        1       Y       62.5
9     B        1       Z       15.0
10    B        2       X       26.7
11    B        2       Y       26.7
12    B        2       Z       46.7

Does anyone know how to do it?

Thanks in advance!


Solution

  • With dplyr, when you have different levels of hierarchy that need managing, you may need multiple group_by() statements. In the code below, I use group_by(Site, Transect, Habitat) to calculate the total length of each habitat in the Site and Transect and then group_by(Site, Transect) to calculate the percentage.

    library(dplyr)
    df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                     Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                     Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
                     Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                     End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
    
    df %>% 
      mutate(length = End-Start) %>% 
      group_by(Site, Transect, Habitat) %>% 
      summarise(tot_length = sum(length)) %>% 
      group_by(Site, Transect) %>% 
      mutate(percentage = 100*tot_length/sum(tot_length))
    #> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
    #> the `.groups` argument.
    #> # A tibble: 12 × 5
    #> # Groups:   Site, Transect [4]
    #>    Site  Transect Habitat tot_length percentage
    #>    <chr>    <dbl> <chr>        <dbl>      <dbl>
    #>  1 A            1 X              4.4       44  
    #>  2 A            1 Y              0.6        6  
    #>  3 A            1 Z              5         50  
    #>  4 A            2 X              6         40  
    #>  5 A            2 Y              3.5       23.3
    #>  6 A            2 Z              5.5       36.7
    #>  7 B            1 X              4.5       22.5
    #>  8 B            1 Y             12.5       62.5
    #>  9 B            1 Z              3         15  
    #> 10 B            2 X              4         26.7
    #> 11 B            2 Y              4         26.7
    #> 12 B            2 Z              7         46.7
    

    Created on 2023-02-16 by the reprex package (v2.0.1)

    In your code from above, when you are calculating the percentage, your data are still grouped by Habitat, so the percentage you are calculating is within the Habitat rather than across habitats within Site and Transect pairs.