Search code examples
r

Using gsub to extract only capital letters of a certain length


I have a string where I wish to extract the country code, this will always be in the form of capital letters with 3 characters.

mystring
"Bloggs, Joe GBR London (1)/Bloggs, Joe London (2)" 
"Bloggs, Joe London (1)/Bloggs, Joe  GBR London (2)"  
"Bloggs, Joe London (1)/Bloggs, Joe London (2)" 
"Bloggs, Joe GBR London (1)/Bloggs, Joe GBR London (2)" 
 "Bloggs, J-S GBR London (1)/Bloggs, J-S GBR London (2)" 

What I'm trying to get

mystring
GBR/
/GBR
/
GBR/GBR
GBR/GBR

Blanks are fine if there is no country, I can deal with them

I've tried a couple of things which I have seen on here, one which tried to remove all characters that aren't capital but then I am left with other letters which I don't want like the capitals from the name and location. I then tried to do similar by trying to remove all letters that don't start and end with a capital (also had no joy due to name issues);

gsub("[^A-Z$]", "", mystring)

If I just keep all capital letters where there are 3 letter that might work, but I can't quite get the code right, I think it would look something like below if anyone know or even knows a more robust method;

gsub("[^A-Z$]{3}", "", mystring)

Solution

  • I like stringr::str_extract for extracting patterns from strings. This lets you simply enter the pattern you want, rather than trying to replace everything else:

    mystring = c("Bloggs, Joe GBR London (1)/Bloggs, Joe London (2)", 
    "Bloggs, Joe London (1)/Bloggs, Joe  GBR London (2)"  ,
    "Bloggs, Joe London (1)/Bloggs, Joe London (2)" ,
    "Bloggs, Joe GBR London (1)/Bloggs, Joe GBR London (2)", 
     "Bloggs, J-S GBR London (1)/Bloggs, J-S GBR London (2)" 
    )
    
    ## extract first matches
    stringr::str_extract(mystring, "[A-Z]{3}")
    # [1] "GBR" "GBR" NA    "GBR" "GBR"
    
    ## or get all matches with `str_extract_all`
    stringr::str_extract_all(mystring, "[A-Z]{3}")
    # [[1]]
    # [1] "GBR"
    # 
    # [[2]]
    # [1] "GBR"
    # 
    # [[3]]
    # character(0)
    # 
    # [[4]]
    # [1] "GBR" "GBR"
    # 
    # [[5]]
    # [1] "GBR" "GBR"
    

    It is possible to do the same in base R using substring or regmatches and regexpr as seen in answers here.