I have two files, one is full of keywords (roughly 2,000 rows) and the other is full of text (roughly 770,000 rows). The keyword file looks like:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
The text file looks like:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
What I want is to iterate through the text file and look for fuzzy matches (must include each word in the "Keyword" column) and return a new column that displays TRUE or False. If that is TRUE, then I want a third column to display the event name. So something that looks like:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
I am able to successfully do my fuzzy matches (after converting everything to lowercase) with stuff like this, thanks to Molx (How can I check if multiple strings exist in another string?):
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
However, I am getting stuck when I try to fuzzy match the whole files. I tried something like this:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
I don't think I'm having trouble converting the right things to vectors and strings. My keywordFile$Keyword column is a bunch of string vectors and my text$Description column is a character string. But I'm struggling with how to iterate properly through both files. The error I'm getting is
Error in ... replacement has 13 rows, data has 1
Has anyone done anything like this before?
I'm not completely sure I get your question, as I wouldn't call grepl()
fuzzy matching. It will rather catch the keyword if it is inside a longer word. So "cat" and "catastrophe" would be a match event thought these words are very different.
I chose instead to write an answer were you can control the distance between strings that stil constitute a match:
Load libraries:
library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)
Make dictionary and data object:
dict <- tibble(Event_Name = c(
"All-day tabby fest",
"All-day tabby fest",
"Maine Coon Grooming",
"Maine Coon Grooming"
), Keyword = c(
"tabby, all-day",
"tabby, fest",
"maine coon, groom",
"coon, groom"
)) %>%
mutate(Keyword = strsplit(Keyword, ", ")) %>%
unnest(Keyword)
string <- tibble(id = 1:4, Description = c(
"Bring your tabby to the fest on Tuesday",
"All cats are welcome to the fest on Tuesday",
"Mainecoon grooming will happen at noon Wednesday",
"Maine coons will be pampered at noon on Wednesday"
))
Apply dictionary to data:
string_annotated <- string %>%
unnest_tokens(output = "word", input = Description) %>%
stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>%
mutate(match = !is.na(Keyword))
> string_annotated
# A tibble: 34 x 5
id word Event_Name Keyword match
<int> <chr> <chr> <chr> <lgl>
1 1 bring NA NA FALSE
2 1 your NA NA FALSE
3 1 tabby All-day tabby fest tabby TRUE
4 1 tabby All-day tabby fest tabby TRUE
5 1 to NA NA FALSE
6 1 the NA NA FALSE
7 1 fest All-day tabby fest fest TRUE
8 1 on NA NA FALSE
9 1 tuesday NA NA FALSE
10 2 all NA NA FALSE
# ... with 24 more rows
max_dist
controls what still constitutes a match. A distance between strings of 1
or less in this case finds a match for all texts, but I tried it with a no-match string as well.
If you want to get this long format back into the original:
string_annotated_col <- string_annotated %>%
group_by(id) %>%
summarise(Description = paste(word, collapse = " "),
match = sum(match),
keywords = toString(unique(na.omit(Keyword))),
Event_Name = toString(unique(na.omit(Event_Name))))
> string_annotated_col
# A tibble: 4 x 5
id Description match keywords Event_Name
<int> <chr> <int> <chr> <chr>
1 1 bring your tabby tabby to the fest on tuesday 3 tabby, fest All-day tabby fest
2 2 all cats are welcome to the fest on tuesday 1 fest All-day tabby fest
3 3 mainecoon grooming will happen at noon wednesday 2 maine coon, coon Maine Coon Grooming
4 4 maine coons will be pampered at noon on wednesday 2 coon Maine Coon Grooming
Feel free to ask questions if a part of the answer doesn't make sense to you. Some of it is explained in here. Except the fuzzy matching part.