Search code examples
rdataframedata-manipulationstrsplit

In R, convert/split 1-column dataframe into 4 columns based on splitting content in strings


This feels like a fairly difficult data manipulation / dataframe fixup issue in R. We have the following messy dataframe, currently organized such that multiple columns of information are packed into the X2 column. Using fake names, emails, phone numbers in the example below:

coach_info <- structure(list(X1 = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_), X2 = c("TBA\r\n Head Coach", "Bobby Flowes\r\n Associate Head Women's Basketball Coach", 
"Jimmy Jimm\r\n Assistant Women's Basketball Coach", "Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) [email protected]", 
NA, "Gabens Spar\r\n Men's Basketball Graduate Assistant [email protected]", 
"A.B. Better\r\n Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information [email protected]\r\n   111-222-3333", 
"Nick Romanov\r\n Head Crew Coach\r\n [email protected]\r\n 123-123-1234", 
"Name Lasttt\r\n Assistant Coach")), row.names = c(1L, 2L, 3L, 
7L, 12L, 16L, 17L, 25L, 29L), class = "data.frame")

head(coach_info, 4)
    X1                                                                                   X2
1 <NA>                                                                   TBA\r\n Head Coach
2 <NA>                             Bobby Flowes\r\n Associate Head Women's Basketball Coach
3 <NA>                                    Jimmy Jimm\r\n Assistant Women's Basketball Coach
7 <NA> Rod Barber\r\n Head Men's Basketball Coach\r\n       (123) [email protected]

We're trying to split the X2 column info into 4 columns for Name, Title, Email and Phone. When we strsplit(coach_info$X2, '\r\n'), what we get is a messy nested list, and the splitting using \r\n is imperfect as the \r\n are missing in some of the rows:

enter image description here enter image description here enter image description here

In addition to this, the inner-nested lists each have differing numbers of elements, as many rows are missing 1 or more of the name, phone number or email address:

> unlist(lapply(strsplit(coach_info$X2, '\r\n'), length))
 [1] 2 2 2 3 1 2 3 4 2

Our goal is something as close as possible to this:

output_df <- data.frame(
    Name = c('TBA', 'Bobby Flowes', 'Jimmy Jimm', 'Rod Barber', NA, 'Gaben Spar', 'A.B. Better', 'Nick Romanov', 'Name Lasttt'),
    Title = c('Head Coach', "Associate Head Women's Basketball Coach", "Assistant Women's Basketball Coach", "Head Men's Basketball Coach",
              NA, " Men's Basketball Graduate Assistant", "Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associate",
              "Head Crew Coach", "Assistant Coach"),
    Email = c(NA, NA, NA, "[email protected]", NA, "[email protected]", "[email protected]", "[email protected]", NA),
    Phone = c(NA, NA, NA, "(123) 456-7890", NA, NA, "111-222-3333", "123-123-1234", NA),
    stringsAsFactors = FALSE
  )
  

>   head(output_df, 4)
          Name                                   Title             Email          Phone
1          TBA                              Head Coach              <NA>           <NA>
2 Bobby Flowes Associate Head Women's Basketball Coach              <NA>           <NA>
3   Jimmy Jimm      Assistant Women's Basketball Coach              <NA>           <NA>
4   Rod Barber             Head Men's Basketball Coach [email protected] (123) 456-7890

It seems like it may be impossible to cleanly split up the strings where no spaces or \r\n exist between different fields, like in the screenshots above. We're just trying to get as close as possible at this point...


Solution

  • how about something like this

    require(data.table)
    setDT(coach_info)
    
    re.phone <- '.*(\\d{3}[^[:alnum:]]*\\d{3}[^[:alnum:]]*\\d{4}).*'
    re.email <- ".*[^_[:alnum:]\\-\\.]([_[:alnum:]\\-\\.]+@[[:alnum:]\\.]+).*"
    re.text1 <- '([[:alnum:][:blank:]]+)\r\n([[:alnum:][:blank:][:punct:]]+).*'
    
    
    coach_info[,processed:=X2]
    
    coach_info[grepl(re.phone,X2), phone:=gsub(re.phone,'\\1',X2)]
    coach_info[!is.na(phone), processed:=gsub(phone,' ',X2,fixed=T),by=phone]
    
    coach_info[grepl(re.email,processed), email:=gsub(re.email,'\\1',processed)]
    coach_info[!is.na(email), processed:=gsub(email,' ',processed,fixed=T),by=email]
    
    coach_info[, Name:=gsub(re.text1,'\\1',processed)]
    coach_info[, Title:=gsub(re.text1,'\\2',processed)]