Search code examples
rregextextdata-cleaning

Replacing phone numbers in different formats in R


I am using a regex that is suggested here to repleace any type of phone numbers with aaaaaaaaaa. This a snapshot of my data :

df <- data.frame(
  text = c(
    'my number is (123)-416-567',
    "1 321 124 7889 is valid",
    'why not taking 987-012-6782',
    '120 967 3256 is correct',
    'call at 888 969 9919',
    'please text at 1 647 989 1213'
  )
)

df %>% select(text)

                           text
1    my number is (123)-416-567
2       1 321 124 7889 is valid
3   why not taking 987-012-6782
4       120 967 3256 is correct
5          call at 888 969 9919
6 please text at 1 647 989 1213

My code is

df %>% 
  mutate(
    text = str_replace_all(text, '^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$', 'aaaaaaaaaa')
  )

and I get this error

Error: '\+' is an unrecognized escape in character string starting "'^(\+"
Error: unexpected ')' in "  )"

The outcome should be like

                           text
1           my number is aaaaaaaaaa
2           aaaaaaaaaa is valid
3           why not taking aaaaaaaaaa
4           aaaaaaaaaa is correct
5          call at aaaaaaaaaa
6          please text at  aaaaaaaaaa

Solution

  • You can use

    str_replace_all(text, '(?:\\+?\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{3,4}(?!\\d)', 'aaaaaaaaaa')
    

    See the regex demo.

    Details:

    • (?:\+?\d{1,2}\s)? - an optional sequence of an optional + and then one or two digits and a whitespace
    • \(? - an optional (
    • \d{3} - three digits
    • \)? - an optional )
    • [\s.-] - a -, . or whitespace
    • \d{3} - three digits
    • [\s.-] - a -, . or whitespace
    • \d{3,4} - three or four digits
    • (?!\d) - no digit alowed right after.

    Notes:

    • In a string literal, a backslash is defined with double \ char
    • ^ and $ match start/end of string so in this case, it makes sense to remove the ^ anchor, and replace $ with a right-digit boundary
    • The last \d{3} did not match numbers where the last part contained four digits, so I replaced it with \d{3,4}.