I have a string which looks like :
clean_text
[1] "01/04/2018 Japan - Ghana 7:1 04/04/2018 Turkey - Estonia 3:2 06/04/2018 USA - Mexico 4:1 France - Nigeria 8:0 07/04/2018 Turkey - Estonia 3:0 08/04/2018 USA - Mexico 6:2 09/04/2018 France - Canada 1:0 10/04/2018 Cuba - Nicaragua 4:2 12/04/2018 Cuba - Nicaragua 1:2 18/04/2018 St. Vincent/Grenadines - St. Lucia 0:1 St. Kitts & Nevis - Dominica 1:0 Cuba - Barbados 7:0 19/04/2018 Haiti - Virgin Islands 7:0 20/04/2018 St. Lucia - Dominica 0:0 St. Kitts & Nevis - St. Vincent/Grenadines 2:0 Jamaica - Barbados 3:2 21/04/2018 Virgin Islands - Haiti 0:14 22/04/2018 Dominica - St. Vincent/Grenadines 3:0 St. Kitts & Nevis - St. Lucia 0:1 Jamaica - Cuba 0:1 25/04/2018 Guyana - Grenada 0:0 Trinidad & Tobago - Suriname 7:0 27/04/2018 Suriname - Guyana 2:2 Antigua & Barbuda - Curaçao 2:1 Trinidad & Tobago - Grenada 8:1 29/04/2018 Grenada - Suriname 5:6 Trinidad & Tobago - Guyana 3:1 "
I want to preprocess it such that I get a list like : Japan , Ghana , Turkey , Estonia , USA, and so on, but that is team names separated by ' - '
.
I am trying the code:
pattern <- "[[:alpha:]][[:alpha:] -]*[[:alpha:]]"
matches <- str_extract_all(clean_text, pattern)[[1]]
which gives me the list as :
[1] "Japan - Ghana" "Turkey - Estonia"
[3] "USA - Mexico" "France - Nigeria"
[5] "Turkey - Estonia" "USA - Mexico"
[7] "France - Canada" "Cuba - Nicaragua"
[9] "Cuba - Nicaragua" "St"
[11] "Vincent" "Grenadines - St"
[13] "Lucia" "St"
[15] "Kitts" "Nevis - Dominica"
[17] "Cuba - Barbados" "Haiti - Virgin Islands"
[19] "St" "Lucia - Dominica"
[21] "St" "Kitts"
[23] "Nevis - St" "Vincent"
[25] "Grenadines" "Jamaica - Barbados"
[27] "Virgin Islands - Haiti" "Dominica - St"
[29] "Vincent" "Grenadines"
[31] "St" "Kitts"
[33] "Nevis - St" "Lucia"
[35] "Jamaica - Cuba" "Guyana - Grenada"
[37] "Trinidad" "Tobago - Suriname"
[39] "Suriname - Guyana" "Antigua"
[41] "Barbuda - Curaçao" "Trinidad"
[43] "Tobago - Grenada" "Grenada - Suriname"
[45] "Trinidad" "Tobago - Guyana
but which is wrong cause it splits the string where '.'
or '&'
or '-'
are present. In fact I only want the string to split wherever there is ' - '
this is present what change should I make in my code?
Perhaps a more iterative approach is helpful here:
library(stringr)
s <- "01/04/2018 Japan - Ghana 7:1 04/04/2018 Turkey - Estonia 3:2 06/04/2018 USA - Mexico 4:1 France - Nigeria 8:0 07/04/2018 Turkey - Estonia 3:0 08/04/2018 USA - Mexico 6:2 09/04/2018 France - Canada 1:0 10/04/2018 Cuba - Nicaragua 4:2 12/04/2018 Cuba - Nicaragua 1:2 18/04/2018 St. Vincent/Grenadines - St. Lucia 0:1 St. Kitts & Nevis - Dominica 1:0 Cuba - Barbados 7:0 19/04/2018 Haiti - Virgin Islands 7:0 20/04/2018 St. Lucia - Dominica 0:0 St. Kitts & Nevis - St. Vincent/Grenadines 2:0 Jamaica - Barbados 3:2 21/04/2018 Virgin Islands - Haiti 0:14 22/04/2018 Dominica - St. Vincent/Grenadines 3:0 St. Kitts & Nevis - St. Lucia 0:1 Jamaica - Cuba 0:1 25/04/2018 Guyana - Grenada 0:0 Trinidad & Tobago - Suriname 7:0 27/04/2018 Suriname - Guyana 2:2 Antigua & Barbuda - Curaçao 2:1 Trinidad & Tobago - Grenada 8:1 29/04/2018 Grenada - Suriname 5:6 Trinidad & Tobago - Guyana 3:1"
s |>
str_split_1("\\d+:\\d+") |>
str_remove("\\d{2}/\\d{2}/\\d{4}") |>
str_trim()
#> [1] "Japan - Ghana"
#> [2] "Turkey - Estonia"
#> [3] "USA - Mexico"
#> [4] "France - Nigeria"
#> [5] "Turkey - Estonia"
#> [6] "USA - Mexico"
#> [7] "France - Canada"
#> [8] "Cuba - Nicaragua"
#> [9] "Cuba - Nicaragua"
#> [10] "St. Vincent/Grenadines - St. Lucia"
#> [11] "St. Kitts & Nevis - Dominica"
#> [12] "Cuba - Barbados"
#> [13] "Haiti - Virgin Islands"
#> [14] "St. Lucia - Dominica"
#> [15] "St. Kitts & Nevis - St. Vincent/Grenadines"
#> [16] "Jamaica - Barbados"
#> [17] "Virgin Islands - Haiti"
#> [18] "Dominica - St. Vincent/Grenadines"
#> [19] "St. Kitts & Nevis - St. Lucia"
#> [20] "Jamaica - Cuba"
#> [21] "Guyana - Grenada"
#> [22] "Trinidad & Tobago - Suriname"
#> [23] "Suriname - Guyana"
#> [24] "Antigua & Barbuda - Curaçao"
#> [25] "Trinidad & Tobago - Grenada"
#> [26] "Grenada - Suriname"
#> [27] "Trinidad & Tobago - Guyana"
#> [28] ""
Created on 2023-03-17 with reprex v2.0.2