Search code examples
rliststrsplit

R split string and keep section


I have a string containing the starting lineup (extracted from the web) for a rugby game, it looks like this:

 "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"

What I want is essentially a table with two columns, one being the player's number, and the other being the player's name. e.g.

position     name
1            Joe Moody
2            Codie Taylor
3            Owen Franks
4            Scott Barrett
...          ...

For all players.

I've tried using strsplit, splitting by the "," however the problem becomes the first player:

"Crusaders: 15 David Havili"

and the number 1 and 16 merge

"1 Joe MoodyReplacements: 16 Sam Anderson-Heather".

Any ideas?


Solution

  • Using stringr::str_match_all() and some regex you can find and extract all matches, being careful to use non-greedy (?) operator and matching end of line where there is no comma:

    library(dplyr)
    library(stringr)
    ea <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
    ea <- unlist(strsplit(ea, "Replacements: "))
    
    tibble(jersey = str_match_all(ea, "\\d+") %>% unlist(),
    player = str_match_all(ea, "(?<=\\d\\s).*?(?=.$|,)") %>% unlist())
    
    # A tibble: 23 x 2
       jersey player               
       <chr>  <chr>                
     1 15     David Havili         
     2 14     Seta Tamanivalu      
     3 13     Jack Goodhue         
     4 12     Ryan Crotty          
     5 11     George Bridge