Search code examples
rstringsumicd

How to make subgroups by prefixes from ICD data?


I have a large ICD-10 data and I would like to create subgroups and get a sum out of it.

For example, I have 'JAL01, JAL20 and JAL21' and I would need a sum of all the codes starting with 'JAL'. How do I do that?


Solution

  • Substring first 3 letters, then group by and sum:

    # example data
    df1 <- data.frame(icd = c("JAL01", "JAL20", "JAL21", "foo11", "foo22"),
                      x = 1:5)
    
    # get 1st 3 letters
    df1$grp <- substr(df1$icd, 1, 3)
    
    # get sum per group
    aggregate(x ~ grp, df1, sum)
    #   grp x
    # 1 foo 9
    # 2 JAL 6