Search code examples
rdplyrnlp

How to check a string for a list of words in a file using R


data1 in strings.xlsx has text as rows. Column name is 'heading'
"Quick fox ran over the desk"
"Quick red fox jumped over the dog"
"Red fox crossed the road"
"Quick red dog crossed the ROAD"

data2 in keywords.xlsx has keywords: fox
Jump
DOG
cross
road

I want to check all data2 keywords in data1 The csv output file should have 'heading' column from data1 And all keywords in data2 should become columns with 1s and 0s for match/no match

I have tried the following

library(readxl)
library(openxlsx)
library(tidyverse)
library(data.table)
data1 = read_excel("strings.xlsx")
data1$heading = sapply(data1$heading, tolower) #need the same for keyword.xlsx
v1 <- readxl::read_excel('keywords.xlsx') %>% pull(1)
for(v in v1){
data1 <- data1 %>%
mutate(!! v := as.integer(heading %like% v))
}

Solution

  • We can use map

    library(dplyr)
    library(purrr)
    v1 <- c('vitamin', 'amino')
    map_dfc(v1, ~ 
            as.integer(data[['columnname']] %like% .x)) %>%
        set_names(v1) %>%
        bind_cols(data1, .)
    

    Or with a for loop

    v1 <- c('vitamin', 'amino')
    for(v in v1){
           data1 <- data1 %>%
                       mutate(!! v := as.integer(columnname %like% v))
     }
    

    If the vector of words are getting read from an excel file (assuming it is the first column)

    v1 <- readxl::read_excel('file.xlsx') %>%
                      pull(1)