The goal of this question is how to create a column with values that depend on the maximum value of other columns that can be grouped.
I have a dataset with the tree species proportions of some plots for 4 years. I would like to extract which is the dominant species in each plot at each year.
It is possible to create a test dataset by running these lines:
jobFun <- function(n) {
m <- matrix(runif(3*n,0,1), ncol=3)
m<- sweep(m, 1, rowSums(m), FUN="/")
m
}
year1 = data.frame(jobFun(10))
colnames(year1) = paste0("Year1_", c("Spruce", "Pine", "Birch"))
year2 = data.frame(jobFun(10))
colnames(year2) = paste0("Year2_", c("Spruce", "Pine", "Birch"))
year3 = data.frame(jobFun(10))
colnames(year3) = paste0("Year3_", c("Spruce", "Pine", "Birch"))
year4 = data.frame(jobFun(10))
colnames(year4) = paste0("Year4_", c("Spruce", "Pine", "Birch"))
df = cbind(year1, year2, year3, year4)
I can make a loop for each of the years and compare the values, but is not very clean and I would like to learn more about mutate
commands and dplyr
. For example, for year 1:
# Dominant Species in year 1
for(i in 1:nrow(df)){
if(which.max(select(df[i,],c("Year1_Spruce","Year1_Pine","Year1_Birch"))) == 1){
df$dom.species.y1[i] <- "S"
} else if(which.max(select(df[i,],c("Year1_Spruce","Year1_Pine","Year1_Birch"))) == 2){
df$dom.species.y1[i] <- "P"
}else{
df$dom.species.y1[i] <- "D"
}
}
> df$dom.species.y1
[1] "S" "P" "P" "S" "S" "P" "D" "P" "D" "S"
Before creating test data set, I would set the random seed for reproducibility
set.seed(42)
The key to using dplyr
(and the rest of the tidyverse
) is first having your data in a tidy format, part of which is every variable in a separate column. It appears the three variables are year, species, and plot id (represented by each row). This can be accomplished with tidyr
.
tidyDf <- df |>
tibble::rowid_to_column("plot_id") |>
tidyr::pivot_longer(
cols = starts_with("Year"),
names_to = c("year", "species"),
names_sep = "_"
)
#> # A tibble: 120 × 4
#> plot_id year species value
#> <int> <chr> <chr> <dbl>
#> 1 1 Year1 Spruce 0.402
#> 2 1 Year1 Pine 0.201
#> 3 1 Year1 Birch 0.397
#> 4 1 Year2 Spruce 0.508
#> 5 1 Year2 Pine 0.262
#> 6 1 Year2 Birch 0.230
#> 7 1 Year3 Spruce 0.520
#> 8 1 Year3 Pine 0.0331
#> 9 1 Year3 Birch 0.447
#> 10 1 Year4 Spruce 0.351
#> # … with 110 more rows
Once it's in a tidy format, you can group by the variables of interest and summarize using the helper function first
to select where value is max for each group (-value
is tidy syntax for descending order).
domSpeciesByYear <- tidyDf |>
dplyr::group_by(plot_id, year) |>
dplyr::summarize(
dom_species = dplyr::first(species, order_by = -value),
.groups = "drop"
)
#> # A tibble: 40 × 3
#> plot_id year dom_species
#> <int> <chr> <chr>
#> 1 1 Year1 Spruce
#> 2 1 Year2 Spruce
#> 3 1 Year3 Spruce
#> 4 1 Year4 Spruce
#> 5 2 Year1 Spruce
#> 6 2 Year2 Spruce
#> 7 2 Year3 Spruce
#> 8 2 Year4 Birch
#> 9 3 Year1 Birch
#> 10 3 Year2 Birch
#> # … with 30 more rows
This essentially solves the question about which species is dominant in each plot in each year, but you can pivot back if you want it in the exact format from the question.
domSpeciesByYear |>
dplyr::mutate(dom_species = stringr::str_sub(dom_species, start = 1, end = 1)) |>
tidyr::pivot_wider(
id_cols = plot_id,
names_from = year,
names_glue = "dom.species.y{stringr::str_sub(year, start = -1)}",
values_from = dom_species
) |>
dplyr::select(-plot_id)
#> # A tibble: 10 × 4
#> dom.species.y1 dom.species.y2 dom.species.y3 dom.species.y4
#> <chr> <chr> <chr> <chr>
#> 1 S S S S
#> 2 S S S B
#> 3 B B S B
#> 4 B P B S
#> 5 S P S P
#> 6 P P P P
#> 7 P P S P
#> 8 B P S P
#> 9 S P S S
#> 10 B P B B