Search code examples
rdata-manipulationseparator

Separating Dataframe columns into more columns by various delimiters


I have a dataset that I have tried to give a sample of using the dput command below. The problem I'm running into is trying to separate out the data by delimiter.

    > dput(head(team_data))
    structure(list(X1 = 2:6, 
names2 = c("Andre Callender  Seton Hall Preparatory School (West Orange, NJ)", "Gosder Cherilus  Somerville (Somerville, MA)", "Justin Bell  Mount Vernon (Alexandria, VA)", "Tom Anevski  Elder (Cincinnati, OH)", "Brad Mueller  Mars Area (Mars, PA)"), 
pos2 = c("RB 5-10 185", "OT 6-7 270", "TE 6-3 250", "OT 6-5 265", "CB 6-0 170"), rating2 = c("0.8667 194 18 8", "0.8667 262 20 1", "0.8333 306 14 7", "0.8333 377 25 13", "0.8333 496 36 16"), 
status2 = c("Enrolled   6/30/2003", "Enrolled   6/30/2003", "Enrolled   6/30/2003", "Enrolled   6/30/2003", "Enrolled   6/30/2003"), team = c("Boston-College", "Boston-College", "Boston-College", "Boston-College", "Boston-College"), year = c(2003L, 2003L, 2003L, 2003L, 2003L)), 
.Names = c("X1", "names2", "pos2", "rating2", "status2", "team", "year"), row.names = c(NA, -5L), class = c("tbl_df", 
    "tbl", "data.frame"))

The following is the code I am trying to execute on the above dataset. The following two functions work fine and as expected as far as I can tell.

library(rvest)
library(stringr)
library(tidyr)
library(readxl)
df2<-separate(data=team_data,col=pos2,into= c("Position","Height","Weight"),sep=" ")
df3<-separate(data=df2,col=rating2,into= c("Rating","National","Position","State Rank"),sep=" ")

But then I have significant trouble trying to further separate out the columns of the dataframe. I have tried various ways (examples below) but all of the pieces of code below produce the same error, "Error: Data source must be a dictionary".

df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep="(")
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep='\\(|\\)')
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep="   ")

The ultimate goal would be to separate out the "names2" column at the "(" and the "," and remove the ")" so that I would end up with 3 columns of data. For the other column ("status2") the goal would be to separate out the "Enrolled" from the date of enrollment.

From what I have read the error I'm getting indicates that I am duplicating column names, but I can't figure out where that is happening.


Solution

  • You are using Position twice, once in df2 and once in df3. This works for me:

    team_data %>%
      separate(col=pos2, into= c("Position","Height","Weight"), sep=" ") %>%
      separate(col=rating2,into= c("Rating","National","Position2","State Rank"),sep=" ")%>%
      separate(col=names2,into= c("Name","Geo"),sep="\\(")  %>%
      separate(col=status2,into= c("Date_Enrollment","Enroll_Status"),sep="   ")