Search code examples
regexstringstringr

Regex for " last, first" names in r


I am trying to extract the names from a list of documents I have. Name are always the first occurrence of the last, first pattern.

I am trying the following regex with stringr but it does not work ^[A-Z][a-z]+,\s[A-Z][a-z]+$ I believe this is because the pattern before the regex is not constant throughout the documents. Please see the example below.


library(stringr)
m = c("   name: aaaaaa, bbbbbb  age: 25" , "age 34   person: aaaa, bbbb", " location: A  name 
 aaaa, bbbbbbb", "aaaaa, bbbb")

str_extract(m, "^[A-Z][a-z]+,\\s[A-Z][a-z]+$")

# I tried to add a white space before and after the beginning of the pattern 
# but still not working:

str_extract(m, "^\\s[A-Z][a-z]+,\\s[A-Z][a-z]+$\\s")


The expected output is the list of names: aaaaaa, bbbbbb aaaa, bbbb aaaa, bbbbbbb aaaaa, bbbb

Appreciate your suggestions.


Solution

  • # Base example
    m = c("   name: aaaaaa, bbbbbb  age: 25" ,
          "age 34   person: aaaa, bbbb",
          " location: A  name aaaa, bbbbbbb")
    
    
    # This function implement the solution
    extract_lastfirst <- function(x) {
      stopifnot(`"{stringr} is required"` = requireNamespace("stringr"))
    
      stringr::str_extract(x, "\\w+, \\w+") # This line solve the problem
    }
    extract_lastfirst(m)
    #> [1] "aaaaaa, bbbbbb" "aaaa, bbbb"     "aaaa, bbbbbbb"
    
    # In the text there is a mention to "the first occurrence of", so try
    # the solution with an example the have a "second" occurrence.
    n <- c("name: aa, bb wrong: cc, dd")
    extract_lastfirst(n)
    #> [1] "aa, bb"
    
    
    
    
    # formal tests for the solution ---------------------------------------
    # (no output means test passed)
    
    library(testthat)
    
    testthat::test_that("goal achieved", {
      expected_out_m <- c("aaaaaa, bbbbbb", "aaaa, bbbb", "aaaa, bbbbbbb")
      expect_equal(extract_lastfirst(m), expected_out_m)
    })
    
    testthat::test_that("multiple occurrences", {
      expected_out_n <- c("aa, bb")
      expect_equal(extract_lastfirst(n), expected_out_n)
    })
    

    Created on 2020-09-02 by the reprex package (v0.3.0)

    devtools::session_info()
    #> ─ Session info ───────────────────────────────────────────────────────────────
    #>  setting  value                       
    #>  version  R version 4.0.2 (2020-06-22)
    #>  os       Ubuntu 20.04.1 LTS          
    #>  system   x86_64, linux-gnu           
    #>  ui       X11                         
    #>  language (EN)                        
    #>  collate  en_US.UTF-8                 
    #>  ctype    en_US.UTF-8                 
    #>  tz       Europe/Rome                 
    #>  date     2020-09-02                  
    #> 
    #> ─ Packages ───────────────────────────────────────────────────────────────────
    #>  package     * version date       lib source        
    #>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
    #>  backports     1.1.9   2020-08-24 [1] CRAN (R 4.0.2)
    #>  callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.2)
    #>  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.2)
    #>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
    #>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
    #>  devtools      2.3.1   2020-07-21 [1] CRAN (R 4.0.2)
    #>  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.2)
    #>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
    #>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)
    #>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
    #>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
    #>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
    #>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
    #>  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
    #>  knitr         1.29    2020-06-23 [1] CRAN (R 4.0.2)
    #>  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.2)
    #>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
    #>  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
    #>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
    #>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
    #>  processx      3.4.3   2020-07-05 [1] CRAN (R 4.0.2)
    #>  ps            1.3.4   2020-08-11 [1] CRAN (R 4.0.2)
    #>  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.2)
    #>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
    #>  rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
    #>  rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.2)
    #>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.2)
    #>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
    #>  stringi       1.4.6   2020-02-17 [1] CRAN (R 4.0.2)
    #>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
    #>  testthat    * 2.3.2   2020-03-02 [1] CRAN (R 4.0.2)
    #>  usethis       1.6.1   2020-04-29 [1] CRAN (R 4.0.2)
    #>  withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.2)
    #>  xfun          0.16    2020-07-24 [1] CRAN (R 4.0.2)
    #>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
    #> 
    #> [1] /home/cl/R/x86_64-pc-linux-gnu-library/4.0
    #> [2] /usr/local/lib/R/site-library
    #> [3] /usr/lib/R/site-library
    #> [4] /usr/lib/R/library