Apologies in advance for lack of reproducible code. I have a dataframe named survey. One column in the dataframe --- survey$Q17 --- contains multiple string responses (e.g., "High costs of acquisition Lack of relevant technology"). I have been using grepl
to create a new column for each of the possible responses using variations of the grepl(needle, haystack)
command.
When trying to find the all instances of "High costs of implementation", I have made the following counter-intuitive discoveries:
survey$Q17.hcoi <- (
grepl("implementation",
survey$Q17)
)
table(survey$Q17.hcoi == "TRUE")
This returns 27 TRUE. However, the following code...
survey$Q17.hcoi <- (
grepl("of implementation",
survey$Q17)
)
table(survey$Q17.hcoi == "TRUE")
...returns 26 TRUE. The following code...
survey$Q17.hcoi <- (
grepl("costs of implementation",
survey$Q17)
)
table(survey$Q17.hcoi == "TRUE")
also returns 26 TRUE. Finally, the following code...
survey$Q17.hcoi <- (
grepl("High costs of implementation",
survey$Q17)
)
table(survey$Q17.hcoi == "TRUE")
Returns 0 TRUE.
This is perplexing. I would think that a longest search phrase in grepl
(e.g., "High costs of implementation") would be superior to a shorter search phrase (e.g., "implementation"). In this case, it is not. The longest search phrase retruns 0 TRUE, whereas the shortest returns 27.
Can anyone explain why this might be? I have used trimws(survey$Q17)
to remove excess white spaces before using grepl
, as I thought that might prevent some problems.
This is a simple misunderstanding of regular expressions and how they work, I'd suggest reading the ?regex
help page.
Regex matches the entire string when no regex separator is used. 'of implementation' would be "Match any string that contains 'of implementation'". 'High costs of implementation' contains that substring, while if you instead use 'High costs of implementation' this would be looking for any string containing that exact sequence of words. So this would for example not match the string 'of implementation', because it does not have '*High costs *' as suffix.
If what you want is to match any string that contains any of the words you could use the regex or operator |
.
grepl('High|cost|of|implementation', X)
replacing X
with your vector. Not that the space " " is itself also a character that gets matched, so `* of implementation*' is not the same as 'of implementation'!