Search code examples
rtidyversetext-mining

Generating a dummy variable using grepl()


I wrote the following and it works w/out errors.

df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE))
df2$qualifications

This is the output, which shows 1 if any of the words above is mentioned and 0 otherwise.

[1] 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0
 [51] 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1
[101] 0 1 0 0

This is a dataset with job postings along with the education qualifications they are searching for and I am interested in assigning a dummy variable for each educational level mentioned in a job's description.

Specifically, I am looking for something that looks like below, where 0 is where no qualifications is mentioned 1 High school 2 Bachelor 3 masters 4 phd

1] 0 2 4 1 3 1 0 1 0 1 1 1 2 1 0 1 

Solution

  • Using for-loops:

    df2 = data.frame(description = sample(educ, 100, TRUE))
    df2$qualifications = NA #creating empty column
    
    #placing the possible levels into a vector
    educ = c("high school", "Bachelor", "master", "phd")
    
    #for each value in educ, if description has that value assign the new column one of the 4 numbers
    for(i in educ){
      value = grepl(i, df2$description, ignore.case=TRUE)
      df2$qualifications[which(value)] = (1:4)[educ==i]}
    

    As you're already creating a categorical variable, i'd recommend using the