I have a string which contains multiple country names put together. The names are not separated by any pattern other than that a capital letter follows a small letter without a space (spaces are however part of some country name, e.g. Democratic Republic of Congo.
My stringr/regex attempt is rather close, but I am losing the first letter of the second and subsequent country names. Any help? Many thanks.
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'forcats' was built under R version 3.6.3
v <- structure(list(countries = c("Democratic Republic of the CongoSweden",
"DenmarkIran (Islamic Republic of)", "AfghanistanSweden", "AzerbaijanSwedenGermany",
"BangladeshSweden", "DenmarkSri Lanka", "CanadaSri Lanka", "DenmarkNigeria",
"CanadaIreland", "CanadaMexico")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L))
v %>%
mutate(index=row_number()) %>%
#mutate(countries_split=str_split(countries, "[A-Z][a-z]*[a-z:space:]+(?=[A-Z])")) %>%
#mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+[A-Z][a-z]{1,20}+).")) %>%
mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)[A-Z]")) %>%
unnest(countries_split)
#> # A tibble: 21 x 3
#> countries index countries_split
#> <chr> <int> <chr>
#> 1 Democratic Republic of the CongoSweden 1 Democratic Republic of the Congo
#> 2 Democratic Republic of the CongoSweden 1 weden
#> 3 DenmarkIran (Islamic Republic of) 2 Denmark
#> 4 DenmarkIran (Islamic Republic of) 2 ran (Islamic Republic of)
#> 5 AfghanistanSweden 3 Afghanistan
#> 6 AfghanistanSweden 3 weden
#> 7 AzerbaijanSwedenGermany 4 Azerbaijan
#> 8 AzerbaijanSwedenGermany 4 weden
#> 9 AzerbaijanSwedenGermany 4 ermany
#> 10 BangladeshSweden 5 Bangladesh
#> # ... with 11 more rows
Created on 2020-03-06 by the reprex package (v0.3.0)
We can use positive lookahead to capture the second group.
library(tidyverse)
v %>%
mutate(row = row_number(),
countries = str_split(countries,
"(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)(?=[A-Z])")) %>%
unnest(countries)
# A tibble: 21 x 2
# countries row
# <chr> <int>
# 1 Democratic Republic of the Congo 1
# 2 Sweden 1
# 3 Denmark 2
# 4 Iran (Islamic Republic of) 2
# 5 Afghanistan 3
# 6 Sweden 3
# 7 Azerbaijan 4
# 8 Sweden 4
# 9 Germany 4
#10 Bangladesh 5
# … with 11 more rows