I have a vector of strings of the form "letters numbers", I want to extract the numbers using RegEx implemented in stringr::str_extract
with pattern "\\d*"
. The results are very confusing:
# R 4.2.3
# install.packages('stringr')
library(stringr)
# case 1
str_extract('word 42', '\\d*')
# ""
# case 2 (?)
str_extract('42 word', '\\d*')
# "42"
# case 3
str_extract('word 42', '\\d+')
# "42"
# case 4 (?!)
str_extract('word 42', '\\d*$')
# "42"
# case 5
str_extract('42 word', '\\d*$')
# ""
In all the cases the expected result is "42"
.
I am a novice with RegEx's, but the pattern = '\\d*'
seems pretty straightforward - I understand it as "match any number of consecutive numeric characters".
The fact that it doesn't work for case 1 but does for case 2 is quite counterintuitive by itself. And then the roles seem to be reversed when using pattern = '\\d*$'
(cases 4 and 5).
I have experimented more with other functions (str_match
and str_match_all
), but the results where still not clear.
I couldn't find such a specific thing elsewhere, so I hoped more experienced R/RegEx users could provide a clarification on what is going on under the hood.
I understand it as "match any number of consecutive numeric characters".
Any number including zero. And it will match at the first position where the pattern succeeds. Because \d*
can successfully match zero digits, it will never look anywhere besides the beginning of the string. If there are no digits there, then you get ""
.
Most likely, you want \d+
instead, which matches one or more digits. Then, the match will fail at positions where there aren't any digits, and you will get the first string of digits in the string.
But \d*$
works for you in case 4 because, again, it's looking for the first position where there are zero or more digits followed by end of string. It could match zero digits at the end of string, but it doesn't get a chance to because it finds the position right before the 42
before it finds the position right at the end of the string. In case 5 there are no digits at the end of the string so it has to wait until the end, where it can successfully match zero digits.