Search code examples
rdataframedatecasemutate

R Find the earliest date of a group and create a factor variable Y/N to indicate if it is the first date of the group


I have a dataframe that looks like that:

Issuer.Name Issue.Date    
name1 01/12/2021    
name2 05/04/2022    
name2 21/10/2021    
name3 08/09/2020    
name4 30/08/2023    
name4 12/05/2021    
name4 18/10/2022    
name5 01/12/2021    

I want to create a new factor variable "Y/N" that checks by Issuer.Name if this is the first Issue.Date of the group Issuer.Name. It should return something like that:

Issuer.Name Issue.Date First.Issue.Date    
name1 01/12/2021 Y    
name2 05/04/2022 N    
name2 21/10/2021 Y    
name3 08/09/2020 Y    
name4 30/08/2023 N    
name4 12/05/2021 Y    
name4 18/10/2022 N    
name5 01/12/2021 Y    

I used this command and it worked fine, but I think it is certainly possible to do something more concise

df <- df %>%
        group_by(Issuer.Name) %>% 
        arrange(Issue.Date) %>% 
        mutate(First.Issue.Date = Issue.Date[1]) %>% 
        mutate(First.Issue=case_when(Issue.Date==First.Issue.Date~"Y",.default = "N"))

Solution

  • The issue is that you're not looking for the first date, you're looking for the earliest date. If your data was sorted, these would be the same thing, but it isn't, so they're not.

    dplyr::mutate(df, f = Issue.Date == min(Issue.Date), .by = Issuer.Name)
    
    dplyr::mutate(df, `Y/N` = ifelse(Issue.Date == min(Issue.Date), "Y","N"), .by = Issuer.Name)
    

    Notes:

    1. Generally speaking, including true and false as anything other than TRUE or FALSE is frowned upon, because it takes up more space, and is a bit more of a pain to deal with, and can lead to strange bugs (e.g. if you forget later, and start using lowercase "y" and "n", or "yes" and "no", then nothing will match).

    2. Also, using slashes in variable names, while you can do it, it can lead to bugs, because without the backticks, R will interpret Y/N as a variable Y divided by a variable N, which is bad.

    3. When using group_by(), it's a good idea to ungroup() at the end (or better yet, use .by as I have, as it saves having to remember to do that). I personally have seen at least a half dozen people in the past few months with questions which ultimately came back to them forgetting to ungroup(), and then being confused by why their code wasn't giving the correct results. Data:

    df <- data.frame(
      Issuer.Name = c("name1", "name2", "name2", "name3", "name4", "name4", "name4", "name5"),
      Issue.Date = as.Date(c("01/12/2021", "05/04/2022", "21/10/2021", "08/09/2020", "30/08/2023", "12/05/2021", "18/10/2022", "01/12/2021"), format="%d/%m/%Y"))