Search code examples
rstringr

Why does stringr::str_extract always return NA for a certain character vector


I've been trying to use str_extract to extract dates from data I've scraped off the website of the world trade organization. The problem is that for whatever reason, it's always returning NA. However when I type in the strings myself, the function suddenly works. Any ideas as to what is going on?

> country_comparison$status[1:10]
 [1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995" "Implementation notified by respondent on 25 September 1997"                  
 [3] "In consultations on 4 April 1995"                                             "Implementation notified by respondent on 25 September 1997"                  
 [5] "Settled or terminated (withdrawn, mutually agreed solution) on 20 July 1995"  "Settled or terminated (withdrawn, mutually agreed solution) on 19 July 1995" 
 [7] "Settled or terminated (withdrawn, mutually agreed solution) on 5 July 1996"   "Mutually acceptable solution on implementation notified on 9 January 1998"   
 [9] "Panel established, but not yet composed on 11 October 1995"                   "Mutually acceptable solution on implementation notified on 9 January 1998"   

> country_comparison$status[1:10] %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
 [1] NA NA NA NA NA NA NA NA NA NA

> c("Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995", "Implementation notified by respondent on 25 September 1997") %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
[1] "29 March 1995"     "25 September 1997"

Solution

  • Kind of a guess, but if those strings are scraped from www.wto.org and the first one origins from https://www.wto.org/english/tratop_e/dispu_e/cases_e/ds1_e.htm , then depending on how those were collected, there might be few non-breaking spaces:

    <span class="paraboldcolourtext">
        Settled or terminated (withdrawn, mutually agreed solution)
    </span> on <b>29&nbsp;March&nbsp;1995</b>
    

    Try replacing " " (space) in regex with \\s to match any whitespace:

    library(stringr)
    s <- "Settled or terminated (withdrawn, mutually agreed solution) on 29\u00A0March\u00A01995"
    # looks like a regular space:
    s
    #> [1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995"
    
    # until you check it with something that can highlight unusual whitespace:
    stringr::str_view(s)
    #> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on 29{\u00a0}March{\u00a0}1995
    
    # replacing " " in regex with \\s:
    str_view(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
    #> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on <29{\u00a0}March{\u00a0}1995>
    str_extract(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
    #> [1] "29 March 1995"
    

    Created on 2023-09-23 with reprex v2.0.2