Search code examples
rstring-matchinggrepl

R Partial string match and returns a value from the matched row (like "match" in excel)


I would like to ask you if there is a similar function like "match" in excel in R.

For example if I have a dataset with people's educational degrees:

> edu
chr [1:4] "Bachelor" "NA" "Master" "Superieur" 

And an international mapping system by ISCED:

> ISCED
 Main education program                      English translation                   Code
 Brevet d'enseignement supérieur (BES)       certificate of higher education        5
 bachelier de transition                     Bachelor                               6
 Bachelor                                    Bachelor                               6
 Master                                      Master                                 7       

I wonder if there is a function that can help identify partially the strings from the vector edu from the first column of the dataframe ISCED, and then if there is a match, the code (5, 6 or 7) will be returned.

I know there are functions like "%like%" or "grepl", but I am looking for something that can skim through all values of the vector edu and not just one particular string defined each time.

Does anybody have any insights? Or would you guys suggest using a loop with the "grepl"?

Thank you!


Solution

  • One way, is using grep.

    Making a vector of strings with paste0 and getting an index wherever it matches the first column (Main_education_group). Using that index to fetch the respective Code from the data frame.

    ISCED$Code[grep(paste0(edu, collapse = "|"), ISCED$Main_education_program)]
    
    #[1] 6 7
    

    EDIT

    To get the updated output as per OP's request we can use sapply and loop over ever element in edu and check of it is present or not in Main_education_program

    sapply(edu, function(x) if(length(grep(x, ISCED$Main_education_program)) > 0) 
                             ISCED$Code[grep(x, ISCED$Main_education_program)] else NA)
    

    which returns

    #  Bachelor        NA    Master  Superieur 
    #        6         NA         7        NA 
    

    If we need it without the names we can wrap it in unname

    unname(sapply(edu, function(x) if(length(grep(x, ISCED$Main_education_program))>0) 
                      ISCED$Code[grep(x, ISCED$Main_education_program)] else NA ))
    
    #[1]  6 NA  7 NA