Looks like a repeat question but other answers haven't helped me. I'm trying to extract any 8 digit number in a text. The number could be anywhere in the text. It could be stand alone or follow or be followed by string. Basically, I need to extract any occurrence of 8 consecutive numerical characters from a string in R, using regex only.
This is what I attempted but to no avail:
> my_text <- "the number 5849 and 5555555555 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't. both 12345678JE and RG10293847 should turn up as well."
> ## this doesn't work
> gsub('(\\d{8})', '\\1', my_text)
[1] "the number 5849 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't.both 12345678JE and RG10293847 should turn up as well."
My desired output should extract the following numbers:
12345654
99119911
12345678
10293847
While at it, I would also be grateful if the answer includes a second regex expression for extracting only the first occurrence of the 8-digit number:
12345654
EDIT: I have a very large table (about 200 million rows) for which i need to operate this on one column. what is the most efficient solution?
EDIT: I realised that there was a lack of cases in my text case. there are also some digits in the text that are more than 8 digits long, but I only want to extract the ones that are exactly 8 digits.
We can use str_extract_all
stringr::str_extract_all(my_text, "\\d{8}")[[1]]
#[1] "12345654" "99119911" "12345678" "10293847"
Similarly, in base R we can use gregexpr
and regmatches
regmatches(my_text, gregexpr("\\d{8}", my_text))[[1]]
To get last 8 digit number, we can use
sub('.*(\\d{8}).*', '\\1', my_text)
#[1] "10293847"
whereas for first one, we can use
sub('.*?(\\d{8}).*', '\\1', my_text)
#[1] "12345654"
EDIT
For the updated case where we want to match with exactly 8 digits (and not more) we can use str_match_all
with negative look behind
stringr::str_match_all(my_text, "(?<!\\d)\\d{8}(?!\\d)")[[1]][, 1]
#[1] "12345654" "99119911" "12345678" "10293847"
Here, we get 8-digit numbers which is not followed and proceeded by a digit.
A simple option could also be to extract all the numbers from the string and keep only 8-digit numbers
v1 <- regmatches(my_text, gregexpr("\\d+", my_text))[[1]]
v1[nchar(v1) == 8]