Search code examples
rregexquanteda

How to use regex with kwic to get all matches?


I can't seem to get the desired output using quanteda's qwic. Here's what I've tried:

library(quanteda)
library(tidyverse)

Given this text

text <- "This is a phone number: 222-222-2222. Here's another phone number...(111)111 1111. This -- 333-3333 -- aint a complete phone number."

Here's a regex that matches for most US phone numbers along with any characters each side of the number

regex.phone1 <- "\\D\\(?\\d{3}\\)?[.\\s-]?\\s*\\d{3}[.\\s-]?\\s*[.\\s-]*\\d{4}\\D"

It matches the first number here, which means the regex is working as expected.

regmatches(text,regexpr(regex.phone1,text))

" 222-222-2222." 

But kwic doesn't match anything. This:

 kwic(
  x = text,
  pattern = regex.phone1,
  window = 5,
  valuetype = "regex",
  case_insensitive = TRUE
) %>%
  as_tibble

returns:

A tibble: 0 x 7
… with 7 variables: docname <chr>, from <int>, to <int>, pre <chr>, keyword <chr>,
  post <chr>, pattern <fct>

My desire is to have it match all phone numbers, which in this case is:

"222-222-2222."

".(111)111 1111."

(and put those in the normal form of the kwic output that displays pre, post, and more).


Solution

  • I've tried to match the phone numbers by making phrases from regular expressions.

    library(quanteda)
    library(tidyverse)
    
    text <- "This is a number: 541 145-8884 also 222-222-2222 Here's (444)111-1111. No. 555 666 7774"
    
    kwic(
      x = text,
      phrase(c("^[\\d]{10}$","^[\\d]{3} [\\d]{3}-[\\d]{4}$","^[\\d]{3}-[\\d]{3}-[\\d]{4}$","^[\\d]{3} [\\d]{3} [\\d]{4}$","^[(] [\\d]{3} [)] [\\d]{3}-[\\d]{4}$")),
      window = 3,
      valuetype = "regex",
      separator = " ",
      case_insensitive = FALSE
    ) %>%
      
    print(as_tibble)
    
    # Output:                                                                                                 
    #   [text1, 6:7]                a number: |   541 145-8884   | also 222-222-2222 Here's
    #   [text1, 9:9]        541 145-8884 also |   222-222-2222   | Here's( 444             
    # [text1, 11:14] also 222-222-2222 Here's | ( 444 ) 111-1111 | . No.                   
    # [text1, 18:20]                    . No. |   555 666 7774   |