I have a dataframe with factors. These factors have some levels. I could not find exact matches based on their names using regex.
df <- structure(list(age = structure(1:2, .Label = c("18-25",
">25"), class = "factor"), `M` = c("13.4",
"12.8"), 'N' = c("73", "76"), `SD` = c("6.8",
"6.6")), row.names = 51:52, class = "data.frame")
My df
age M N SD
51 18-25 13.4 73 6.8
52 >25 12.8 76 6.6
First try:
regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE, fixed = T)
[1] -1 -1 -1 -1
attr(,"match.length")
[1] -1 -1 -1 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Second Try
saved_level_name <- structure(list(V1 = structure(1L, .Label = "18-25", class = "factor")), row.names = c(NA,
-1L), class = "data.frame")
regexpr(pattern = saved_level_name, text= df, ignore.case = FALSE, perl = FALSE, fixed = T)
[1] 1 4 -1 -1
attr(,"match.length")
[1] 1 1 -1 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Third Try (compare two outputs!)
saved_name_level_2 <- structure(list(V4 = structure(1L, .Label = ">25", class = "factor")), row.names = c(NA,
-1L), class = "data.frame")
regexpr(pattern = saved_level_name, text= df[1], ignore.case = FALSE, perl = FALSE, fixed = T)
regexpr(pattern = saved_name_level_2, text= df[1], ignore.case = FALSE, perl = FALSE, fixed = T)
[1] 1
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[1] 1
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Forth Try
regexpr(pattern = as.character( saved_name_level ), text= df, ignore.case = FALSE, perl = FALSE, fixed = T)
[1] -1 -1 -1 -1
attr(,"match.length")
[1] -1 -1 -1 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
First try : 0 results Second try : No meaning out of results (1, 4 ?) Third try : Same results with different inputs at face value. Forth Try : No results!
Possibly, regex finds the stored value of factors and not their face value/name?
How Can I use Regex to search factor names, and not their values?
The reason this is failing can be found with debug
:
debugonce(regexpr)
regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE, fixed = T)
# debugging in: regexpr(pattern = "18-25", text = df, ignore.case = FALSE, perl = FALSE,
# fixed = T)
# debug: {
# if (!is.character(text))
# text <- as.character(text)
# .Internal(regexpr(as.character(pattern), text, ignore.case,
# perl, fixed, useBytes))
# }
debug: if (!is.character(text)) text <- as.character(text)
debug: text <- as.character(text)
Ok, so let R run that as.character
command, which is converting the "text" (really a frame) into a character version of it.
text
# [1] "1:2" "c(\"13.4\", \"12.8\")" "c(\"73\", \"76\")"
# [4] "c(\"6.8\", \"6.6\")"
That last part is the clincher. When regexpr
is converting your text
argument (which is really intended to be a character
vector), it is converting your factor
s of df$age
into a character representation of the factor numbers, as 1:2
. (The fact that it generates a :
-sequence is interesting to me ... but that's a different point.)
Obviously "1:2"
is not going to match your "18-25"
test. You really should be checking individual vectors/columns. If you have multiples, then perhaps
lapply(df, function(v) regexpr(pattern = "18-25", text=v, ignore.case = FALSE, perl = FALSE, fixed = T))
or df[,1:3]
or df[,-5]
or whatever you can use to delineate which columns to use or not use. But checking a whole frame at once with factors will not work.
If all you want to do is find instances in the factors where the pattern matches (instead of extracting or replacing it), then perhaps grepl
is more suited:
lapply(df, grepl, pattern = "18-25", fixed = TRUE)
# $age
# [1] TRUE FALSE
# $M
# [1] FALSE FALSE
# $N
# [1] FALSE FALSE
# $SD
# [1] FALSE FALSE