I've been trying to use str_extract to extract dates from data I've scraped off the website of the world trade organization. The problem is that for whatever reason, it's always returning NA. However when I type in the strings myself, the function suddenly works. Any ideas as to what is going on?
> country_comparison$status[1:10]
[1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995" "Implementation notified by respondent on 25 September 1997"
[3] "In consultations on 4 April 1995" "Implementation notified by respondent on 25 September 1997"
[5] "Settled or terminated (withdrawn, mutually agreed solution) on 20 July 1995" "Settled or terminated (withdrawn, mutually agreed solution) on 19 July 1995"
[7] "Settled or terminated (withdrawn, mutually agreed solution) on 5 July 1996" "Mutually acceptable solution on implementation notified on 9 January 1998"
[9] "Panel established, but not yet composed on 11 October 1995" "Mutually acceptable solution on implementation notified on 9 January 1998"
> country_comparison$status[1:10] %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
[1] NA NA NA NA NA NA NA NA NA NA
> c("Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995", "Implementation notified by respondent on 25 September 1997") %>% str_extract(pattern = "[0-9]{1,2} [A-Za-z]+ [0-9]{4}")
[1] "29 March 1995" "25 September 1997"
Kind of a guess, but if those strings are scraped from www.wto.org and the first one origins from https://www.wto.org/english/tratop_e/dispu_e/cases_e/ds1_e.htm , then depending on how those were collected, there might be few non-breaking spaces:
<span class="paraboldcolourtext">
Settled or terminated (withdrawn, mutually agreed solution)
</span> on <b>29 March 1995</b>
Try replacing "
" (space) in regex with \\s
to match any whitespace:
library(stringr)
s <- "Settled or terminated (withdrawn, mutually agreed solution) on 29\u00A0March\u00A01995"
# looks like a regular space:
s
#> [1] "Settled or terminated (withdrawn, mutually agreed solution) on 29 March 1995"
# until you check it with something that can highlight unusual whitespace:
stringr::str_view(s)
#> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on 29{\u00a0}March{\u00a0}1995
# replacing " " in regex with \\s:
str_view(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
#> [1] │ Settled or terminated (withdrawn, mutually agreed solution) on <29{\u00a0}March{\u00a0}1995>
str_extract(s,"[0-9]{1,2}\\s[A-Za-z]+\\s[0-9]{4}")
#> [1] "29 March 1995"
Created on 2023-09-23 with reprex v2.0.2