I have a dataframe (df
) in which each row represents the start (Start
) and the end (End
) of a specific habitat (Habitat
) within a transect (Transect
) and site (Site
) in meters. It is important to note that the length of the transects varies within and among sites. As an example:
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df
Site Transect Habitat Start End
1 A 1 X 0.0 2.8 # Habitat `X` is between the meters 0 and 2.8
2 A 1 Y 2.8 3.4 # Habitat `Y` is between the meters 2.8 and 3.4
3 A 1 X 3.4 5.0 # Habitat `X` is between the meters 3.4 and 5.0
4 A 1 Z 5.0 10.0 # Habitat `Z` is between the meters 5 and 10.0
5 A 2 Z 0.0 1.5
6 A 2 Y 1.5 5.0
7 A 2 X 5.0 8.0
8 A 2 Z 8.0 12.0
9 A 2 X 12.0 15.0
10 B 1 X 0.0 2.0
11 B 1 Z 2.0 5.0
12 B 1 X 5.0 7.5
13 B 1 Y 7.5 20.0
14 B 2 Z 0.0 4.0
15 B 2 X 4.0 8.0
16 B 2 Y 8.0 12.0
17 B 2 Z 12.0 15.0
In this example, for instance, habitat X
is twice in the transect 1
in site A
. Also, we can observe that the total length of transects 1
and 2
in site A
are 10 and 15 m, respectively. In site B
, the total length of the transects 1
and 2
are 20
and 15
meters, respectively.
What I want is to calculate per Site
and Transect
the percentage that each Habitat
represents with respect to all the habitats presented in terms of meters. For example, in transect 1
and site A
habitat X
represents 4.4 meters of a total length of 10 meters for transect 1
. In site A
and transect 2
, habitat X
has 6 meters from a total length of 15 meters for transect B
.
To this aim, the first thing I do is to calculate the length (Length
) in meters of each habitat record (=row)
df$Length <- df$End - df$Start
Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:
df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Porcentage = (sum(Length)/max(End))*100))
I want to change max(End)
to another expression that represents the total length OF THE TRANSECT. Right now max(End)
represents the last meter (End
) in which a specific habitat was present. How can I include in the code above "maximum value of End
" but within of a specific Site
and Transect
, but not for a specific Habitat
.
How can I do it? My desired output would be this:
Site Transect Habitat Percentage
1 A 1 X 44.0
2 A 1 Y 6.0
3 A 1 Z 50.0
4 A 2 X 40.0
5 A 2 Y 23.3
6 A 2 Z 36.7
7 B 1 X 22.5
8 B 1 Y 62.5
9 B 1 Z 15.0
10 B 2 X 26.7
11 B 2 Y 26.7
12 B 2 Z 46.7
Does anyone know how to do it?
Thanks in advance!
With dplyr
, when you have different levels of hierarchy that need managing, you may need multiple group_by()
statements. In the code below, I use group_by(Site, Transect, Habitat)
to calculate the total length of each habitat in the Site
and Transect
and then group_by(Site, Transect)
to calculate the percentage.
library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %>%
mutate(length = End-Start) %>%
group_by(Site, Transect, Habitat) %>%
summarise(tot_length = sum(length)) %>%
group_by(Site, Transect) %>%
mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups: Site, Transect [4]
#> Site Transect Habitat tot_length percentage
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 A 1 X 4.4 44
#> 2 A 1 Y 0.6 6
#> 3 A 1 Z 5 50
#> 4 A 2 X 6 40
#> 5 A 2 Y 3.5 23.3
#> 6 A 2 Z 5.5 36.7
#> 7 B 1 X 4.5 22.5
#> 8 B 1 Y 12.5 62.5
#> 9 B 1 Z 3 15
#> 10 B 2 X 4 26.7
#> 11 B 2 Y 4 26.7
#> 12 B 2 Z 7 46.7
Created on 2023-02-16 by the reprex package (v2.0.1)
In your code from above, when you are calculating the percentage, your data are still grouped by Habitat
, so the percentage you are calculating is within the Habitat
rather than across habitats within Site
and Transect
pairs.