I am trying to extract the names from a list of documents I have. Name are always the first occurrence of the last, first pattern.
I am trying the following regex with stringr but it does not work ^[A-Z][a-z]+,\s[A-Z][a-z]+$ I believe this is because the pattern before the regex is not constant throughout the documents. Please see the example below.
library(stringr)
m = c(" name: aaaaaa, bbbbbb age: 25" , "age 34 person: aaaa, bbbb", " location: A name
aaaa, bbbbbbb", "aaaaa, bbbb")
str_extract(m, "^[A-Z][a-z]+,\\s[A-Z][a-z]+$")
# I tried to add a white space before and after the beginning of the pattern
# but still not working:
str_extract(m, "^\\s[A-Z][a-z]+,\\s[A-Z][a-z]+$\\s")
The expected output is the list of names: aaaaaa, bbbbbb aaaa, bbbb aaaa, bbbbbbb aaaaa, bbbb
Appreciate your suggestions.
# Base example
m = c(" name: aaaaaa, bbbbbb age: 25" ,
"age 34 person: aaaa, bbbb",
" location: A name aaaa, bbbbbbb")
# This function implement the solution
extract_lastfirst <- function(x) {
stopifnot(`"{stringr} is required"` = requireNamespace("stringr"))
stringr::str_extract(x, "\\w+, \\w+") # This line solve the problem
}
extract_lastfirst(m)
#> [1] "aaaaaa, bbbbbb" "aaaa, bbbb" "aaaa, bbbbbbb"
# In the text there is a mention to "the first occurrence of", so try
# the solution with an example the have a "second" occurrence.
n <- c("name: aa, bb wrong: cc, dd")
extract_lastfirst(n)
#> [1] "aa, bb"
# formal tests for the solution ---------------------------------------
# (no output means test passed)
library(testthat)
testthat::test_that("goal achieved", {
expected_out_m <- c("aaaaaa, bbbbbb", "aaaa, bbbb", "aaaa, bbbbbbb")
expect_equal(extract_lastfirst(m), expected_out_m)
})
testthat::test_that("multiple occurrences", {
expected_out_n <- c("aa, bb")
expect_equal(extract_lastfirst(n), expected_out_n)
})
Created on 2020-09-02 by the reprex package (v0.3.0)
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Ubuntu 20.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Rome
#> date 2020-09-02
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.9 2020-08-24 [1] CRAN (R 4.0.2)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.1 2020-07-21 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.3 2020-07-05 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat * 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.2)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.2)
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/cl/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library