I am an R novice, using to attempt to analyze some demographic data for a plant species. My dataframe consists of:
TagKey (unique identifier), Year (observation year), TagEstablished (year the plant was first found), and StageClass (0=dead, 1=seedling, 2=vegetative, 3=reproductive). There is a row for each year the plant was visited, but I want 1 row per plant, then columns for its status each year. This is in order to track an individual's status from year to year.
example data:
TagKey <- c(PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1)
Year <- c(2020, 2020, 2020, 2021, 2021, 2021)
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)
ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)
I tried using ddply:
ALFO_status <- ddply(ALFO_stages, .(TagKey), dplyr::summarize,
Year_Established = TagEstablished,
Status2020 = if(Year=="2020") {StageClass},
Status2021 = if(Year=="2021") {StageClass})
My output does not group by TagKey as desired. The outputs are correct for their respective years, but the nonapplicable years just spit out NAs. Help?
Based on this sentence: "There is a row for each year the plant was visited, but I want 1 row per plant, then columns for its status each year." It sounds like what you want is to reshape or pivot your data.
'Group by' tends to be part of summarizing data. E.g. count the number of records per year, would involve grouping by the year. Pivoting or reshaping is the process of column contents to column labels or visa-versa.
In R, I would recommend the tidyr
package. Perhaps something like:
TagKey <- c("PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1")
Year <- c(2018, 2019, 2020, 2019, 2020, 2021) # NOTE editted for unique year for each tree
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)
ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)
library(tidyr)
library(dplyr)
ALFO_stages %>% pivot_wider(id_cols = c(TagKey, TagEstablished), names_from = Year, values_from = StageClass)
This produces:
TagKey TagEstablished `2018` `2019` `2020` `2021`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PDPLM040J0_ALIFOR01_Belt_0 2020 1 2 3 NA
2 PDPLM040J0_ALIFOR01_Belt_1 2020 NA 0 3 3
Alternatively, you could do this manually using lots of ifelse
statements:
ALFO_stages %>%
group_by(TagKey, TagEstablished) %>%
summarise(y2018 = max(ifelse(Year == 2018, StageClass, NA), na.rm = TRUE),
y2019 = max(ifelse(Year == 2019, StageClass, NA), na.rm = TRUE),
y2020 = max(ifelse(Year == 2020, StageClass, NA), na.rm = TRUE),
y2021 = max(ifelse(Year == 2021, StageClass, NA), na.rm = TRUE))
These two pieces of code produce equivalent answers (but with different NA handling of missing values).