Search code examples
rdataframedplyrcorruptionsplitstackshape

Error running cSplit when splitstackshape/data.frame and tidyr/dplyr are loaded


Example data file (csv format)

testdf <- read.csv("example.csv")

I am trying to automate some roster-mining. At one point I need to split rows based on names with separators, so cSplit from splitstackshape is perfect. I am also preceding and following the split with a bunch of dplyr data shaping.

loaded libraries:

library(data.table)
library(splitstackshape)
library(tidyr)
library(dplyr)

The problem is that when I load dplyr after data.frame, I get the following message:

Attaching package: ‘dplyr’

The following objects are masked from ‘package:data.table’:

    between, last

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Then when I try to use cSplit:

test <- cSplit(testdf, "Registrar", "/", direction = "long")

I get this error:

Error in `[.tbl_df`(indt, , splitCols, with = FALSE) : 
  unused argument (with = FALSE)

I have tried various permutations - this error only occurs when both data.frame and dplyr are loaded (in either order), and restarting R without dplyr or never loading it makes cSplit work properly.

I need to be able to use both at the same time though, and detaching dplyr doesn't help (just throws up missing dplyr errors).

I have seen this thread but they seem to have come to the conclusion the data is corrupted. This seems likely because if I run on a toy data set,

Name <- "Bo / Ashley"
Date <- "2015-02-04"

testdf2 <- data.frame(Name, Date)

testtoy <- cSplit(testdf2, "Name", "/", direction = "long")

it works fine. But I have no idea how to fix this "corruption".


Solution

  • I haven't updated the functions in "splitstackshape" to work with tbl_df objects. As such, the current workaround would be to add a data.frame in your chain.

    Compare:

    library(splitstackshape)
    library(dplyr)
    
    CT <- tbl_df(head(concat.test))
    
    CT %>% cSplit("Likes")
    # Error in `[.tbl_df`(indt, , splitCols, with = FALSE) : 
    #   unused argument (with = FALSE)
    
    CT %>% data.frame %>% cSplit("Likes")
    #      Name                   Siblings    Hates Likes_1 Likes_2 Likes_3 Likes_4 Likes_5
    # 1:   Boyd Reynolds , Albert , Ortega     2;4;       1       2       4       5       6
    # 2:  Rufus  Cohen , Bert , Montgomery 1;2;3;4;       1       2       4       5       6
    # 3:   Dana                     Pierce       2;       1       2       4       5       6
    # 4: Carole Colon , Michelle , Ballard     1;4;       1       2       4       5       6
    # 5: Ramona           Snyder , Joann ,   1;2;3;       1       2       5       6      NA
    # 6: Kelley          James , Roxanne ,     1;4;       1       2       5       6      NA
    

    Alternatively, since with = FALSE is an argument for use in "data.table", you can use tbl_dt instead of tbl_df objects:

    CT2 <- tbl_dt(head(concat.test))
    
    CT2 %>% cSplit("Likes")
    #      Name                   Siblings    Hates Likes_1 Likes_2 Likes_3 Likes_4 Likes_5
    # 1:   Boyd Reynolds , Albert , Ortega     2;4;       1       2       4       5       6
    # 2:  Rufus  Cohen , Bert , Montgomery 1;2;3;4;       1       2       4       5       6
    # 3:   Dana                     Pierce       2;       1       2       4       5       6
    # 4: Carole Colon , Michelle , Ballard     1;4;       1       2       4       5       6
    # 5: Ramona           Snyder , Joann ,   1;2;3;       1       2       5       6      NA
    # 6: Kelley          James , Roxanne ,     1;4;       1       2       5       6      NA
    

    Of course, if someone create a pull request that solves the issue, I would be more than happy to make the relevant updates :-)