I can't seem to get the desired output using quanteda's qwic
. Here's what I've tried:
library(quanteda)
library(tidyverse)
Given this text
text <- "This is a phone number: 222-222-2222. Here's another phone number...(111)111 1111. This -- 333-3333 -- aint a complete phone number."
Here's a regex that matches for most US phone numbers along with any characters each side of the number
regex.phone1 <- "\\D\\(?\\d{3}\\)?[.\\s-]?\\s*\\d{3}[.\\s-]?\\s*[.\\s-]*\\d{4}\\D"
It matches the first number here, which means the regex is working as expected.
regmatches(text,regexpr(regex.phone1,text))
" 222-222-2222."
But kwic doesn't match anything. This:
kwic(
x = text,
pattern = regex.phone1,
window = 5,
valuetype = "regex",
case_insensitive = TRUE
) %>%
as_tibble
returns:
A tibble: 0 x 7
… with 7 variables: docname <chr>, from <int>, to <int>, pre <chr>, keyword <chr>,
post <chr>, pattern <fct>
My desire is to have it match all phone numbers, which in this case is:
"222-222-2222."
".(111)111 1111."
(and put those in the normal form of the kwic output that displays pre, post, and more).
I've tried to match the phone numbers by making phrases
from regular expressions.
library(quanteda)
library(tidyverse)
text <- "This is a number: 541 145-8884 also 222-222-2222 Here's (444)111-1111. No. 555 666 7774"
kwic(
x = text,
phrase(c("^[\\d]{10}$","^[\\d]{3} [\\d]{3}-[\\d]{4}$","^[\\d]{3}-[\\d]{3}-[\\d]{4}$","^[\\d]{3} [\\d]{3} [\\d]{4}$","^[(] [\\d]{3} [)] [\\d]{3}-[\\d]{4}$")),
window = 3,
valuetype = "regex",
separator = " ",
case_insensitive = FALSE
) %>%
print(as_tibble)
# Output:
# [text1, 6:7] a number: | 541 145-8884 | also 222-222-2222 Here's
# [text1, 9:9] 541 145-8884 also | 222-222-2222 | Here's( 444
# [text1, 11:14] also 222-222-2222 Here's | ( 444 ) 111-1111 | . No.
# [text1, 18:20] . No. | 555 666 7774 |