Search code examples
rdataframesubstringsubsetextract

How to extract capital letters from string before number in R


I have a dataframe with strings in a column. How can I extract only uppercase substrings before a number and add them to another column? An example is for DE, but there are many more country abbreviations and they always appear before a number.

TD<-data.frame(a=c("WHATEVERDE 11111","","Whatever DE 11111","DE 11111",""), 
           b=c("","What DE EverDE 1111","","",""),
           c=c("Whatever","","","","WhateverDE 11111"))

And I would like to create another column as follows:

> TD
                  a                   b                c     result
1  WHATEVERDE 11111                             Whatever         DE
2                   What DE EverDE 1111                          DE
3 Whatever DE 11111                                              DE
4          DE 11111                                              DE
5                                       WhateverDE 11111         DE

I tried to apply the solution:

sub("^([[:alpha:]]*).*", "\\1", "DE 11111") but is not universal.

Vector with abbreviations:

names<-c('AT','BE','DE','BG','CZ','DK','FR','GR','ES','NL','HU','GB','IT')

Solution

  • We loop across the columns, extract the 2 letter uppercase countrycode substring that precedes zero or more spaces and one or more digits, coalesce the output so that it returns the first non-NA extracted element per row

    library(dplyr)
    library(stringr)
    library(purrr)
    library(countrycode)
    pat <- countrycode::codelist %>%
           pull(iso2c) %>% 
           na.omit %>% 
           str_c(collapse = "|") %>% 
           sprintf(fmt = "(%s)(?=\\s*\\d+)")
    
    TD %>% 
       mutate(result = invoke(coalesce, 
         across(everything(), ~ str_extract(., pat))))
    

    -output

                      a                   b                c result
    1  WHATEVERDE 11111                             Whatever     DE
    2                   What DE EverDE 1111                      DE
    3 Whatever DE 11111                                          DE
    4          DE 11111                                          DE
    5                                       WhateverDE 11111     DE