This feels like a fairly difficult data manipulation / dataframe fixup issue in R. We have the following messy dataframe, currently organized such that multiple columns of information are packed into the X2
column. Using fake names, emails, phone numbers in the example below:
coach_info <- structure(list(X1 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), X2 = c("TBA\r\n Head Coach", "Bobby Flowes\r\n Associate Head Women's Basketball Coach",
"Jimmy Jimm\r\n Assistant Women's Basketball Coach", "Rod Barber\r\n Head Men's Basketball Coach\r\n (123) [email protected]",
NA, "Gabens Spar\r\n Men's Basketball Graduate Assistant [email protected]",
"A.B. Better\r\n Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information [email protected]\r\n 111-222-3333",
"Nick Romanov\r\n Head Crew Coach\r\n [email protected]\r\n 123-123-1234",
"Name Lasttt\r\n Assistant Coach")), row.names = c(1L, 2L, 3L,
7L, 12L, 16L, 17L, 25L, 29L), class = "data.frame")
head(coach_info, 4)
X1 X2
1 <NA> TBA\r\n Head Coach
2 <NA> Bobby Flowes\r\n Associate Head Women's Basketball Coach
3 <NA> Jimmy Jimm\r\n Assistant Women's Basketball Coach
7 <NA> Rod Barber\r\n Head Men's Basketball Coach\r\n (123) [email protected]
We're trying to split the X2
column info into 4 columns for Name
, Title
, Email
and Phone
. When we strsplit(coach_info$X2, '\r\n')
, what we get is a messy nested list, and the splitting using \r\n
is imperfect as the \r\n
are missing in some of the rows:
In addition to this, the inner-nested lists each have differing numbers of elements, as many rows are missing 1 or more of the name, phone number or email address:
> unlist(lapply(strsplit(coach_info$X2, '\r\n'), length))
[1] 2 2 2 3 1 2 3 4 2
Our goal is something as close as possible to this:
output_df <- data.frame(
Name = c('TBA', 'Bobby Flowes', 'Jimmy Jimm', 'Rod Barber', NA, 'Gaben Spar', 'A.B. Better', 'Nick Romanov', 'Name Lasttt'),
Title = c('Head Coach', "Associate Head Women's Basketball Coach", "Assistant Women's Basketball Coach", "Head Men's Basketball Coach",
NA, " Men's Basketball Graduate Assistant", "Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associate",
"Head Crew Coach", "Assistant Coach"),
Email = c(NA, NA, NA, "[email protected]", NA, "[email protected]", "[email protected]", "[email protected]", NA),
Phone = c(NA, NA, NA, "(123) 456-7890", NA, NA, "111-222-3333", "123-123-1234", NA),
stringsAsFactors = FALSE
)
> head(output_df, 4)
Name Title Email Phone
1 TBA Head Coach <NA> <NA>
2 Bobby Flowes Associate Head Women's Basketball Coach <NA> <NA>
3 Jimmy Jimm Assistant Women's Basketball Coach <NA> <NA>
4 Rod Barber Head Men's Basketball Coach [email protected] (123) 456-7890
It seems like it may be impossible to cleanly split up the strings where no spaces or \r\n
exist between different fields, like in the screenshots above. We're just trying to get as close as possible at this point...
how about something like this
require(data.table)
setDT(coach_info)
re.phone <- '.*(\\d{3}[^[:alnum:]]*\\d{3}[^[:alnum:]]*\\d{4}).*'
re.email <- ".*[^_[:alnum:]\\-\\.]([_[:alnum:]\\-\\.]+@[[:alnum:]\\.]+).*"
re.text1 <- '([[:alnum:][:blank:]]+)\r\n([[:alnum:][:blank:][:punct:]]+).*'
coach_info[,processed:=X2]
coach_info[grepl(re.phone,X2), phone:=gsub(re.phone,'\\1',X2)]
coach_info[!is.na(phone), processed:=gsub(phone,' ',X2,fixed=T),by=phone]
coach_info[grepl(re.email,processed), email:=gsub(re.email,'\\1',processed)]
coach_info[!is.na(email), processed:=gsub(email,' ',processed,fixed=T),by=email]
coach_info[, Name:=gsub(re.text1,'\\1',processed)]
coach_info[, Title:=gsub(re.text1,'\\2',processed)]