Search code examples
rregexstringrstringi

regex/stringr: splitting joined/sequence of countrynames


I have a string which contains multiple country names put together. The names are not separated by any pattern other than that a capital letter follows a small letter without a space (spaces are however part of some country name, e.g. Democratic Republic of Congo.

My stringr/regex attempt is rather close, but I am losing the first letter of the second and subsequent country names. Any help? Many thanks.

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'forcats' was built under R version 3.6.3
v <- structure(list(countries = c("Democratic Republic of the CongoSweden", 
                             "DenmarkIran (Islamic Republic of)", "AfghanistanSweden", "AzerbaijanSwedenGermany", 
                             "BangladeshSweden", "DenmarkSri Lanka", "CanadaSri Lanka", "DenmarkNigeria", 
                             "CanadaIreland", "CanadaMexico")), class = c("tbl_df", "tbl", 
                                                                          "data.frame"), row.names = c(NA, -10L))



v %>% 
  mutate(index=row_number()) %>% 
  #mutate(countries_split=str_split(countries, "[A-Z][a-z]*[a-z:space:]+(?=[A-Z])")) %>%
  #mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+[A-Z][a-z]{1,20}+).")) %>% 
  mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)[A-Z]")) %>% 
  unnest(countries_split)
#> # A tibble: 21 x 3
#>    countries                              index countries_split                 
#>    <chr>                                  <int> <chr>                           
#>  1 Democratic Republic of the CongoSweden     1 Democratic Republic of the Congo
#>  2 Democratic Republic of the CongoSweden     1 weden                           
#>  3 DenmarkIran (Islamic Republic of)          2 Denmark                         
#>  4 DenmarkIran (Islamic Republic of)          2 ran (Islamic Republic of)       
#>  5 AfghanistanSweden                          3 Afghanistan                     
#>  6 AfghanistanSweden                          3 weden                           
#>  7 AzerbaijanSwedenGermany                    4 Azerbaijan                      
#>  8 AzerbaijanSwedenGermany                    4 weden                           
#>  9 AzerbaijanSwedenGermany                    4 ermany                          
#> 10 BangladeshSweden                           5 Bangladesh                      
#> # ... with 11 more rows

Created on 2020-03-06 by the reprex package (v0.3.0)


Solution

  • We can use positive lookahead to capture the second group.

    library(tidyverse)
    
    v %>%
      mutate(row = row_number(), 
             countries = str_split(countries, 
                       "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)(?=[A-Z])")) %>%
      unnest(countries)
    
    # A tibble: 21 x 2
    #   countries                          row
    #   <chr>                            <int>
    # 1 Democratic Republic of the Congo     1
    # 2 Sweden                               1
    # 3 Denmark                              2
    # 4 Iran (Islamic Republic of)           2
    # 5 Afghanistan                          3
    # 6 Sweden                               3
    # 7 Azerbaijan                           4
    # 8 Sweden                               4
    # 9 Germany                              4
    #10 Bangladesh                           5
    # … with 11 more rows