Search code examples
rdcast

dcast fails to cast character column when the data size is large


I'm using the dcast function in the library(reshape2) package to cast a simple table of three columns

df = data.table(id  = 1:1e6, 
             var = c('continent','subcontinent',...), 
             val = c('America','Caribbean',...)````

by dcast(df, id ~ var, value.var ='val') and it automatically converts the value to the count, i.e.

id     continent   subcontinent
 1     1           1
 2     1           1

However, if I reduce the size to 10000 rows, it correctly outputs

id     continent   subcontinent
 1     America     Caribbean
 2     Europe      West Europe

Is this a bug or I need to change the code somehow? Please help. Thanks!


Solution

  • The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.

    help("dcast", "data.table") says:

    When variable combinations in formula doesn't identify a unique value in a cell, fun.aggregate will have to be specified, which defaults to length if unspecified.

    How to find duplicates in the full dataset

    All occurrences of duplicates can be identified by

    cols <- c("id", "var")
    df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
      order(id)]
    
       id          var           val
    1:  1 subcontinent     Caribbean
    2:  1 subcontinent South America
    

    Note that we are looking for duplicates in id and var as these two form the cells, i.e., rows and columns, of the reshaped result.

    Why unique() doesn't work

    NB: This is the explanation why simply taking unique(df) will not work:

    unique(df)
    
       id          var           val
    1:  1    continent       America
    2:  1 subcontinent     Caribbean
    3:  2    continent        Europe
    4:  2 subcontinent   West Europe
    5:  1 subcontinent South America
    

    does not remove any rows. Consequently,

    dcast(unique(df), id ~ var, value.var = "val")
    
    Aggregate function missing, defaulting to 'length'
       id continent subcontinent
    1:  1         1            2
    2:  2         1            1
    

    Whereas

    unique(df, by = cols)
    
       id          var         val
    1:  1    continent     America
    2:  1 subcontinent   Caribbean
    3:  2    continent      Europe
    4:  2 subcontinent West Europe
    

    has removed the duplicate var for id == 1L. Consequently,

    dcast(unique(df, by = cols), id ~ var, value.var = "val")
    
       id continent subcontinent
    1:  1   America    Caribbean
    2:  2    Europe  West Europe
    

    How to find the row numbers of duplicated rows

    The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5 rows.

    The row indices of the duplicate entries can be found by

    which(duplicated(df, by = cols))
    

    which returns 5 for the sample dataset. For OP's full dataset, I suspect that

    min(which(duplicated(df, by = cols))) > 1e5
    

    is true, i.e., there are no duplicates within the first 1e5 rows.

    How to create character columns even in case of duplicate entries

    OP's own approach using fun.aggregate = function(x) paste(x[1L]) as well as applying unique() on df just aim at removing any disturbing duplicates. The duplicates will be silently dropped.

    Alternatively, toString() can be used as aggregation function which shows the duplicate entries:

    dcast(df, id ~ var, toString, value.var = "val")
    
       id continent             subcontinent
    1:  1   America Caribbean, South America
    2:  2    Europe              West Europe
    

    Data

    library(data.table)
    df <- data.table(
      id  = c(1L, 1L, 2L, 2L, 1L),
      var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
      val = c("America", "Caribbean", "Europe", "West Europe", "South America")
    )
    
    df
    
       id          var           val
    1:  1    continent       America
    2:  1 subcontinent     Caribbean
    3:  2    continent        Europe
    4:  2 subcontinent   West Europe
    5:  1 subcontinent South America