Unexpected dply() output. Not grouping as desired

I am an R novice, using to attempt to analyze some demographic data for a plant species. My dataframe consists of:

TagKey (unique identifier), Year (observation year), TagEstablished (year the plant was first found), and StageClass (0=dead, 1=seedling, 2=vegetative, 3=reproductive). There is a row for each year the plant was visited, but I want 1 row per plant, then columns for its status each year. This is in order to track an individual's status from year to year.

example data:

TagKey <- c(PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1)
Year <- c(2020, 2020, 2020, 2021, 2021, 2021)
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)

ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)

I tried using ddply:

ALFO_status <- ddply(ALFO_stages, .(TagKey), dplyr::summarize,
                     Year_Established = TagEstablished,
                     Status2020 = if(Year=="2020") {StageClass},
                     Status2021 = if(Year=="2021") {StageClass})

My output does not group by TagKey as desired. The outputs are correct for their respective years, but the nonapplicable years just spit out NAs. Help?

Solution

Based on this sentence: "There is a row for each year the plant was visited, but I want 1 row per plant, then columns for its status each year." It sounds like what you want is to reshape or pivot your data.

'Group by' tends to be part of summarizing data. E.g. count the number of records per year, would involve grouping by the year. Pivoting or reshaping is the process of column contents to column labels or visa-versa.

In R, I would recommend the tidyr package. Perhaps something like:

TagKey <- c("PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1")
Year <- c(2018, 2019, 2020, 2019, 2020, 2021) # NOTE editted for unique year for each tree
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)

ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)

library(tidyr)
library(dplyr)

ALFO_stages %>% pivot_wider(id_cols = c(TagKey, TagEstablished), names_from = Year, values_from = StageClass)

This produces:

   TagKey                     TagEstablished `2018` `2019` `2020` `2021`
  <chr>                               <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 PDPLM040J0_ALIFOR01_Belt_0           2020      1      2      3     NA
2 PDPLM040J0_ALIFOR01_Belt_1           2020     NA      0      3      3

Alternatively, you could do this manually using lots of ifelse statements:

ALFO_stages %>%
  group_by(TagKey, TagEstablished) %>%
  summarise(y2018 = max(ifelse(Year == 2018, StageClass, NA), na.rm = TRUE),
            y2019 = max(ifelse(Year == 2019, StageClass, NA), na.rm = TRUE),
            y2020 = max(ifelse(Year == 2020, StageClass, NA), na.rm = TRUE),
            y2021 = max(ifelse(Year == 2021, StageClass, NA), na.rm = TRUE))

These two pieces of code produce equivalent answers (but with different NA handling of missing values).