Search code examples
rregexfindmatchlevels

Find regex matches in the names of factor levels in a df in R


I have a dataframe with factors. These factors have some levels. I could not find exact matches based on their names using regex.

  df <- structure(list(age = structure(1:2, .Label = c("18-25", 
                   ">25"), class = "factor"), `M` = c("13.4", 
                   "12.8"), 'N' = c("73", "76"), `SD` = c("6.8", 
                    "6.6")), row.names = 51:52, class = "data.frame")

My df

     age   M  N  SD
51 18-25 13.4 73 6.8
52   >25 12.8 76 6.6




First try: 

         regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)


    [1] -1 -1 -1 -1
    attr(,"match.length")
    [1] -1 -1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Second Try

     saved_level_name <- structure(list(V1 = structure(1L, .Label = "18-25", class = "factor")), row.names = c(NA, 
     -1L), class = "data.frame") 
     regexpr(pattern = saved_level_name, text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)


    [1]  1  4 -1 -1
    attr(,"match.length")
    [1]  1  1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Third Try (compare two outputs!)

     saved_name_level_2 <- structure(list(V4 = structure(1L, .Label = ">25", class = "factor")), row.names = c(NA, 
     -1L), class = "data.frame")

     regexpr(pattern = saved_level_name, text= df[1], ignore.case = FALSE, perl = FALSE,  fixed = T)

     regexpr(pattern = saved_name_level_2, text= df[1], ignore.case = FALSE, perl = FALSE,  fixed = T)



    [1] 1
    attr(,"match.length")
    [1] 1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

    [1] 1
    attr(,"match.length")
    [1] 1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Forth Try

     regexpr(pattern = as.character( saved_name_level ), text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)

    [1] -1 -1 -1 -1
    attr(,"match.length")
    [1] -1 -1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

First try : 0 results Second try : No meaning out of results (1, 4 ?) Third try : Same results with different inputs at face value. Forth Try : No results!

Possibly, regex finds the stored value of factors and not their face value/name?

How Can I use Regex to search factor names, and not their values?


Solution

  • The reason this is failing can be found with debug:

    debugonce(regexpr)
    regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)
    # debugging in: regexpr(pattern = "18-25", text = df, ignore.case = FALSE, perl = FALSE, 
    #     fixed = T)
    # debug: {
    #     if (!is.character(text)) 
    #         text <- as.character(text)
    #     .Internal(regexpr(as.character(pattern), text, ignore.case, 
    #         perl, fixed, useBytes))
    # }
    debug: if (!is.character(text)) text <- as.character(text)
    debug: text <- as.character(text)
    

    Ok, so let R run that as.character command, which is converting the "text" (really a frame) into a character version of it.

    text
    # [1] "1:2"                   "c(\"13.4\", \"12.8\")" "c(\"73\", \"76\")"    
    # [4] "c(\"6.8\", \"6.6\")"  
    

    That last part is the clincher. When regexpr is converting your text argument (which is really intended to be a character vector), it is converting your factors of df$age into a character representation of the factor numbers, as 1:2. (The fact that it generates a :-sequence is interesting to me ... but that's a different point.)

    Obviously "1:2" is not going to match your "18-25" test. You really should be checking individual vectors/columns. If you have multiples, then perhaps

    lapply(df, function(v) regexpr(pattern = "18-25", text=v, ignore.case = FALSE, perl = FALSE,  fixed = T))
    

    or df[,1:3] or df[,-5] or whatever you can use to delineate which columns to use or not use. But checking a whole frame at once with factors will not work.

    If all you want to do is find instances in the factors where the pattern matches (instead of extracting or replacing it), then perhaps grepl is more suited:

    lapply(df, grepl, pattern = "18-25", fixed = TRUE)
    # $age
    # [1]  TRUE FALSE
    # $M
    # [1] FALSE FALSE
    # $N
    # [1] FALSE FALSE
    # $SD
    # [1] FALSE FALSE