dcast fails to cast character column when the data size is large

I'm using the dcast function in the library(reshape2) package to cast a simple table of three columns

df = data.table(id  = 1:1e6, 
             var = c('continent','subcontinent',...), 
             val = c('America','Caribbean',...)````

by dcast(df, id ~ var, value.var ='val') and it automatically converts the value to the count, i.e.

id     continent   subcontinent
 1     1           1
 2     1           1

However, if I reduce the size to 10000 rows, it correctly outputs

id     continent   subcontinent
 1     America     Caribbean
 2     Europe      West Europe

Is this a bug or I need to change the code somehow? Please help. Thanks!

Solution

The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.

help("dcast", "data.table") says:

When variable combinations in formula doesn't identify a unique value in a cell, fun.aggregate will have to be specified, which defaults to length if unspecified.

How to find duplicates in the full dataset

All occurrences of duplicates can be identified by

cols <- c("id", "var")
df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
  order(id)]

   id          var           val
1:  1 subcontinent     Caribbean
2:  1 subcontinent South America

Note that we are looking for duplicates in id and var as these two form the cells, i.e., rows and columns, of the reshaped result.

Why `unique()` doesn't work

NB: This is the explanation why simply taking unique(df) will not work:

unique(df)

   id          var           val
1:  1    continent       America
2:  1 subcontinent     Caribbean
3:  2    continent        Europe
4:  2 subcontinent   West Europe
5:  1 subcontinent South America

does not remove any rows. Consequently,

dcast(unique(df), id ~ var, value.var = "val")

Aggregate function missing, defaulting to 'length'
   id continent subcontinent
1:  1         1            2
2:  2         1            1

Whereas

unique(df, by = cols)

   id          var         val
1:  1    continent     America
2:  1 subcontinent   Caribbean
3:  2    continent      Europe
4:  2 subcontinent West Europe

has removed the duplicate var for id == 1L. Consequently,

dcast(unique(df, by = cols), id ~ var, value.var = "val")

   id continent subcontinent
1:  1   America    Caribbean
2:  2    Europe  West Europe

How to find the row numbers of duplicated rows

The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5 rows.

The row indices of the duplicate entries can be found by

which(duplicated(df, by = cols))

which returns 5 for the sample dataset. For OP's full dataset, I suspect that

min(which(duplicated(df, by = cols))) > 1e5

is true, i.e., there are no duplicates within the first 1e5 rows.

How to create character columns even in case of duplicate entries

OP's own approach using fun.aggregate = function(x) paste(x[1L]) as well as applying unique() on df just aim at removing any disturbing duplicates. The duplicates will be silently dropped.

Alternatively, toString() can be used as aggregation function which shows the duplicate entries:

dcast(df, id ~ var, toString, value.var = "val")

   id continent             subcontinent
1:  1   America Caribbean, South America
2:  2    Europe              West Europe

Data

library(data.table)
df <- data.table(
  id  = c(1L, 1L, 2L, 2L, 1L),
  var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
  val = c("America", "Caribbean", "Europe", "West Europe", "South America")
)

df