Search code examples
rregexpostal-code

R Regex for identifying UK postcodes


My question is similar to this, but I'm looking for something R specific. I've got a data.frame of tens of thousands of addresses and need to pull out the postcodes. Postcodes are in the UK and formatted {LETTER_LETTER_DIGIT LETTER_LETTER_DIGIT}. Similar to the following:

"8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"

I've used variations of this code with stringr to no avail:

str_extract('^(\\[Gg]\\[Ii]\\[Rr] 0\\[Aa]{2})|(((\\[A-Za-z]\\[0-9]{1,2})|((\\ 
[A-Za-z]\\[A-Ha-hJ-Yj-y]\\[0-9]{1,2})|((\\[AZa-z]\\[0-9]\\[A-Za-z])|(\\[A-Za- 
z]\\[A-Ha-hJ-Yj-y]\\[0-9]?\\[A-Za-z]))))\\[0-9]\\[A-Za-z]{2})$',alfa$Address) 

Solution

  • The ^ and $ anchors require the pattern to match the whole string. You may wrap the pattern with \b(?:<pattern>)\b to match those codes as whole words (\b is a word boundary). Besides, the character classes are "ruined" since you escaped their [ starting bracket (\[ matches literal [ chars). Also, swap the arguments, the first one is input, the second one is the regex. Also, to get all matches, you need to use str_extract_all rather than str_extract.

    You may fix the code like this:

    library(stringr)
    txt <- "8, Longbow Close,\r\nHarlescott Lane,\r\nShrewsbury,\r\nEngland,\r\nSY1 3GZ"
    pattern <- "\\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\\s?[0-9][A-Za-z]{2}))\\b"
    str_extract_all(txt, pattern)
    # => [[1]]
    #   [1] "SY1 3GZ"