I have a dataframe that looks like that:
Issuer.Name Issue.Date
name1 01/12/2021
name2 05/04/2022
name2 21/10/2021
name3 08/09/2020
name4 30/08/2023
name4 12/05/2021
name4 18/10/2022
name5 01/12/2021
I want to create a new factor variable "Y/N" that checks by Issuer.Name if this is the first Issue.Date of the group Issuer.Name. It should return something like that:
Issuer.Name Issue.Date First.Issue.Date
name1 01/12/2021 Y
name2 05/04/2022 N
name2 21/10/2021 Y
name3 08/09/2020 Y
name4 30/08/2023 N
name4 12/05/2021 Y
name4 18/10/2022 N
name5 01/12/2021 Y
I used this command and it worked fine, but I think it is certainly possible to do something more concise
df <- df %>%
group_by(Issuer.Name) %>%
arrange(Issue.Date) %>%
mutate(First.Issue.Date = Issue.Date[1]) %>%
mutate(First.Issue=case_when(Issue.Date==First.Issue.Date~"Y",.default = "N"))
The issue is that you're not looking for the first date, you're looking for the earliest date. If your data was sorted, these would be the same thing, but it isn't, so they're not.
dplyr::mutate(df, f = Issue.Date == min(Issue.Date), .by = Issuer.Name)
dplyr::mutate(df, `Y/N` = ifelse(Issue.Date == min(Issue.Date), "Y","N"), .by = Issuer.Name)
Notes:
Generally speaking, including true and false as anything other than TRUE or FALSE is frowned upon, because it takes up more space, and is a bit more of a pain to deal with, and can lead to strange bugs (e.g. if you forget later, and start using lowercase "y" and "n", or "yes" and "no", then nothing will match).
Also, using slashes in variable names, while you can do it, it can lead to bugs, because without the backticks, R will interpret Y/N as a variable Y divided by a variable N, which is bad.
When using group_by()
, it's a good idea to ungroup()
at the end (or better yet, use .by
as I have, as it saves having to remember to do that). I personally have seen at least a half dozen people in the past few months with questions which ultimately came back to them forgetting to ungroup()
, and then being confused by why their code wasn't giving the correct results.
Data:
df <- data.frame(
Issuer.Name = c("name1", "name2", "name2", "name3", "name4", "name4", "name4", "name5"),
Issue.Date = as.Date(c("01/12/2021", "05/04/2022", "21/10/2021", "08/09/2020", "30/08/2023", "12/05/2021", "18/10/2022", "01/12/2021"), format="%d/%m/%Y"))