Now I have two data, name and text, and I want to calculate the number of occurrences of each name in name in the current year in text, that is, to generate data result. How to achieve this? I tried lapply and grepl, but both failed. Thanks a lot!
name=data.table(year=c(2018,2019,2020),
name0=list(c("A","B","C"),c("B","C"),c("D","E","F")))
text=data.table(year=c(2018,2018,2019,2019,2020),
text0=list(c("DEF","BG","CG"),c("ART","CWW"),c("DLK","BU","FO"),
c("A45","11B","C23"),c("EIU","CM")))
result=data.table(year=c(2018,2018,2018,2019,2019,2020,2020,2020),
name0=c("A","B","C","B","C","D","E","F"),
count=c(1,1,2,2,1,0,1,0))
A merge on unlisted values will work:
library(data.table)
merge(
name[, .(name0 = unlist(name0)), by = .(year)],
text[, .(name0 = unlist(strsplit(unlist(text0), ""))), by=.(year)][, ign := 1],
by = c("year", "name0"), all.x = TRUE, allow.cartesian = TRUE
)[,.(count = sum(!is.na(ign))), by = .(year, name0)]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0
The ign
variable is so that we can force all.x=TRUE
yet account for those that were not found in y
.
Slower but perhaps more memory-frugal method:
namelong <- name[, .(name0 = unlist(name0)), by = .(year)]
namelong
# year name0
# <num> <char>
# 1: 2018 A
# 2: 2018 B
# 3: 2018 C
# 4: 2019 B
# 5: 2019 C
# 6: 2020 D
# 7: 2020 E
# 8: 2020 F
func <- function(yr, nm) text[year == yr, sum(grepl(nm, unlist(text0)))]
namelong[, count := do.call(mapply, c(list(FUN=func), unname(namelong)))]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0