Search code examples
rstringextract

Extracting strings with context in R


I'm working with a dataset containing descriptions of individuals' personal history, and I want to obtain employment data from those descriptions. In particular, I want to know the year in which they got their first job, and I know (because of the nature of the dataset) that those years are likely to be included in the personal descriptions, and almost guaranteed to be in the seventies. Id, First_name, and Description are the variables we've already got, and I want to extract First_job_year from the available data:

library(stringr)
dat <- data.frame(Id = c(1,2,3), 
           First_name = c("Adam", "Bob", "Chris"), 
           Description = c("Adam graduated high school in 1971, got married in 1973, and started working at Ford in 1975", 
           "Bob graduated from university in 1972, and a year later started working in the civil service", 
           "Chris dropped out of school in 1969 and was unemployed for a while, but found work in 1973"),
           First_job_year = c(1975, 1972, 1973))

Now, because I'm looking for a date in the seventies, I had the thought of trying to identify strings starting with "197", something like:

first_job_dates <- str_extract_all(dat$Description, "197.")
first_job_dates
[[1]]
[1] "1971" "1973" "1975"
[[2]]
[1] "1972"
[[3]]
[1] "1973"

Which generates a list for each entry: for Chris, we've got the right year (1973), but for Adam we've got all three of 1971, 1973, and 1975 (where 1975 is the correct year), and for Bob we've got the wrong year. I thought one way to get around this would be to include some context, i.e. to extract the date matching "197." and also extract the surrounding, say, 5 words. Then I could select those matches where the context includes "job" or "work"/"working", for instance - so Adam and Chris would get the correct years, and Bob might get assigned a null value (and I could go through and code these null values by hand). The problem is, I'm not sure what command to use to extract the surrounding 'context' around matches.

Is there some command or package designed for this sort of problem?


Solution

  • This will let you view the surroundings words.

    str_extract_all(dat$Description, ".{0,15}(197\\d).{0,15}")
    
    [[1]]
    [1] "high school in 1971, got married i" "n 1973, and started w"             
    [3] "ing at Ford in 1975"               
    
    [[2]]
    [1] " university in 1972, and a year la"
    
    [[3]]
    [1] " found work in 1973"