I have a string where I wish to extract the country code, this will always be in the form of capital letters with 3 characters.
mystring
"Bloggs, Joe GBR London (1)/Bloggs, Joe London (2)"
"Bloggs, Joe London (1)/Bloggs, Joe GBR London (2)"
"Bloggs, Joe London (1)/Bloggs, Joe London (2)"
"Bloggs, Joe GBR London (1)/Bloggs, Joe GBR London (2)"
"Bloggs, J-S GBR London (1)/Bloggs, J-S GBR London (2)"
What I'm trying to get
mystring
GBR/
/GBR
/
GBR/GBR
GBR/GBR
Blanks are fine if there is no country, I can deal with them
I've tried a couple of things which I have seen on here, one which tried to remove all characters that aren't capital but then I am left with other letters which I don't want like the capitals from the name and location. I then tried to do similar by trying to remove all letters that don't start and end with a capital (also had no joy due to name issues);
gsub("[^A-Z$]", "", mystring)
If I just keep all capital letters where there are 3 letter that might work, but I can't quite get the code right, I think it would look something like below if anyone know or even knows a more robust method;
gsub("[^A-Z$]{3}", "", mystring)
I like stringr::str_extract
for extracting patterns from strings. This lets you simply enter the pattern you want, rather than trying to replace everything else:
mystring = c("Bloggs, Joe GBR London (1)/Bloggs, Joe London (2)",
"Bloggs, Joe London (1)/Bloggs, Joe GBR London (2)" ,
"Bloggs, Joe London (1)/Bloggs, Joe London (2)" ,
"Bloggs, Joe GBR London (1)/Bloggs, Joe GBR London (2)",
"Bloggs, J-S GBR London (1)/Bloggs, J-S GBR London (2)"
)
## extract first matches
stringr::str_extract(mystring, "[A-Z]{3}")
# [1] "GBR" "GBR" NA "GBR" "GBR"
## or get all matches with `str_extract_all`
stringr::str_extract_all(mystring, "[A-Z]{3}")
# [[1]]
# [1] "GBR"
#
# [[2]]
# [1] "GBR"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "GBR" "GBR"
#
# [[5]]
# [1] "GBR" "GBR"
It is possible to do the same in base R using substring
or regmatches
and regexpr
as seen in answers here.