Search code examples
rweb-scrapingtext-mining

Text Mining Scraped Data (R)


I wrote the code below to look for the word "nationality" in a job postings dataset, where I am essentially trying to see how many employers specify that a given candidate must of a particular visa type or nationality.

I know that in the raw data itself (in excel), there are several cases where the job description where the word "nationality" is mentioned.

nationality_finder = function(string){
 
  nationality = c(" ")
  split_string = strsplit(string, split = NULL)
  split_string = split_string[[1]]
  flag = 0
 
    for(letter in split_string){
      if(flag > 0){nationality = append(nationality, letter)}
      if(letter == "nationality "){flag = 1}
      if(letter == " "){flag = flag-0.5}
    }
  nationality = paste(nationality, collapse = '')
  return(nationality)
}


for(n in 1:length(df2$description)){
  df2$nationality[n] <- nationality_finder(df2$description[n])
}

df2%>%
  view()

Furthermore, the code is working w/out errors, but it is not producing what I am looking for. I am essentially looking to create another variable where 1 indicates that the word "nationality" is mention, and 0 otherwise. Specifically, I am looking for words such as "citizen" and "nationality" under the job description variable. And the text under each job description is extremely long but here, I just gave a summarized version for brevity.

Text example for a job description in the dataset

Title: Event Planner

Nationality: Saudi National

Location: Riyadh, Saudi Arabia

Salary: Open

Salary depends on the candidates skills, experience, and other attributes.

Another job description:

- Have recently graduated or looking for a career change and be looking for
an entry level role (we will offer full training)  

- Priority will be taken for applications by U.S. nationality holders 

Solution

  • You can try something like this. I'm assuming you've a data.frame as data, and you want to add a new column.

    dats$check <- as.numeric(grepl("nationality",dats$description,ignore.case=TRUE))
    dats$check
    [1] 1 1 0 1
    

    grepl() is going to detect in the column dats$description the string nationality, ignoring case (ignore.case = TRUE) and as.numeric() is going to convert TRUE FALSE into 1 0.

    With fake data:

    dats <- structure(list(description = c("Title: Event Planner\n \n Nationality: Saudi National\n \n Location: Riyadh, Saudi Arabia\n \n Salary: Open\n \n Salary depends on the candidates skills, experience, and other attributes.", 
    "- Have recently graduated or looking for a career change and be looking for\n an entry level role (we will offer full training)  \n \n - Priority will be taken for applications by U.S. nationality holders ", 
    "do not have that word here", "aaaaNationalitybb"), check = c(1, 
    1, 0, 1)), row.names = c(NA, -4L), class = "data.frame")