r regex string dictionary regex-lookarounds

Finding word matches between named nested list and data frame column

I have a list of lists that are named for certain categories of jobs and each nested list is a list of keywords for that job category and am trying to check a column in a data frame with a list of job titles to see if the keywords are in the job title. The ultimate goal is to categorize each job as best as possible. I am providing a sample of the data as there are over 15 thousand job titles and 25 different job categories to check.

This is in Rstudio. I have tried using lapply with str_detect. The following is the code that I used.

library(stringr) 

cat.keys <- list(Internship='Intern', 
    Information.Technology=c('IT', 'Information Technology', 'Software', 'Developer'), 
    Healthcare=c('RN', 'LPN', 'Doctor', 'Nurse'), 
    Maintenance=c('Custodian', 'Janitor'))

jobs.df <- data.frame(Company=c('Big Brothers Big Sisters', 'Big Brothers Big Sisters', 
    'Big Brothers Big Sisters', 'American Red Cross', 'American Red Cross', 
    'American Red Cross', 'DeMolay International', 'Legal Aid Association', 
    'St.Mary’s Church'), 
    Job.Title = c('Intern', 'Marketing Intern', 'Special Events Internship Program', 
    'RN', 'Nurse', 'Registered Nurse', 'Director of IT - DeMolay International', 
    'SWITCHBOARD/INTAKE SPECIALIST', 'CHURCH CUSTODIAN - part-time'))
lapply(jobs.df$Job.Title, 
    function(x) sapply(cat.keys, function(y) str_detect(x, fixed(y))))

I want it to return a list of lists of the length of my original cat.keys list but with TRUE/FALSE values, which is what it returns. That did most of what I wanted, however the problem I am encountering is that when a shorter word is found in a longer word (for example, 'intern' is also found in 'international', it categorizes something like 'International Ambassador' as an internship or SWITCHBOARD would return IT). The issue with the IT example is also that I am looking for exact case matches but if the job title has different capitalization, such as a 'intern' instead of 'Intern', there will not be a match, however if I make it so that it ignores capitalization, the issue with RN arises as a lowercase rn appears in 'Intern'.

Solution

You can take advantage of word boundaries in your regex patterns (and use regex(), not fixed()) to help your search. This should get your started--let me know if you run into more issues:

# Adding word boundaries to each string
cat.keys2 <- lapply(cat.keys, function(x) paste0("\\b", x, "\\b"))

# Using new cat.key with regex() and ignoring case
lapply(jobs.df$Job.Title, 
       function(x) sapply(cat.keys2, function(y) str_detect(x, regex(y, ignore_case = T))))

Also, now that you are using regex you could change things like "\\bIntern\\b" to "\\bIntern\\b|\\bInternship\\b" (aka, you could collapse your patterns into one), or you could add it like you have been doing, of course. Whatever suits your needs.