My question is similar to this, but I'm looking for something R
specific. I've got a data.frame of tens of thousands of addresses and need to pull out the postcodes. Postcodes are in the UK and formatted {LETTER_LETTER_DIGIT LETTER_LETTER_DIGIT}. Similar to the following:
"8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"
I've used variations of this code with stringr
to no avail:
str_extract('^(\\[Gg]\\[Ii]\\[Rr] 0\\[Aa]{2})|(((\\[A-Za-z]\\[0-9]{1,2})|((\\
[A-Za-z]\\[A-Ha-hJ-Yj-y]\\[0-9]{1,2})|((\\[AZa-z]\\[0-9]\\[A-Za-z])|(\\[A-Za-
z]\\[A-Ha-hJ-Yj-y]\\[0-9]?\\[A-Za-z]))))\\[0-9]\\[A-Za-z]{2})$',alfa$Address)
The ^
and $
anchors require the pattern to match the whole string. You may wrap the pattern with \b(?:<pattern>)\b
to match those codes as whole words (\b
is a word boundary). Besides, the character classes are "ruined" since you escaped their [
starting bracket (\[
matches literal [
chars). Also, swap the arguments, the first one is input, the second one is the regex. Also, to get all matches, you need to use str_extract_all
rather than str_extract
.
You may fix the code like this:
library(stringr)
txt <- "8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"
pattern <- "\\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\\s?[0-9][A-Za-z]{2}))\\b"
str_extract_all(txt, pattern)
# => [[1]]
# [1] "SY1 3GZ"