Search code examples
rdata.tablelapplystringrgrepl

How to count the number multiple strings appear in another string list by group?


Now I have two data, name and text, and I want to calculate the number of occurrences of each name in name in the current year in text, that is, to generate data result. How to achieve this? I tried lapply and grepl, but both failed. Thanks a lot!

name=data.table(year=c(2018,2019,2020),
                  name0=list(c("A","B","C"),c("B","C"),c("D","E","F")))
text=data.table(year=c(2018,2018,2019,2019,2020),
                text0=list(c("DEF","BG","CG"),c("ART","CWW"),c("DLK","BU","FO"),
                           c("A45","11B","C23"),c("EIU","CM")))
result=data.table(year=c(2018,2018,2018,2019,2019,2020,2020,2020),
                 name0=c("A","B","C","B","C","D","E","F"),
                 count=c(1,1,2,2,1,0,1,0))

Solution

  • A merge on unlisted values will work:

    library(data.table)
    merge(
      name[, .(name0 = unlist(name0)), by = .(year)],
      text[, .(name0 = unlist(strsplit(unlist(text0), ""))), by=.(year)][, ign := 1],
      by = c("year", "name0"), all.x = TRUE, allow.cartesian = TRUE
    )[,.(count = sum(!is.na(ign))), by = .(year, name0)]
    #     year  name0 count
    #    <num> <char> <int>
    # 1:  2018      A     1
    # 2:  2018      B     1
    # 3:  2018      C     2
    # 4:  2019      B     2
    # 5:  2019      C     1
    # 6:  2020      D     0
    # 7:  2020      E     1
    # 8:  2020      F     0
    

    The ign variable is so that we can force all.x=TRUE yet account for those that were not found in y.


    Slower but perhaps more memory-frugal method:

    namelong <- name[, .(name0 = unlist(name0)), by = .(year)]
    namelong
    #     year  name0
    #    <num> <char>
    # 1:  2018      A
    # 2:  2018      B
    # 3:  2018      C
    # 4:  2019      B
    # 5:  2019      C
    # 6:  2020      D
    # 7:  2020      E
    # 8:  2020      F
    
    func <- function(yr, nm) text[year == yr, sum(grepl(nm, unlist(text0)))]
    namelong[, count := do.call(mapply, c(list(FUN=func), unname(namelong)))]
    #     year  name0 count
    #    <num> <char> <int>
    # 1:  2018      A     1
    # 2:  2018      B     1
    # 3:  2018      C     2
    # 4:  2019      B     2
    # 5:  2019      C     1
    # 6:  2020      D     0
    # 7:  2020      E     1
    # 8:  2020      F     0