r memory text-processing categorical-data logfile-analysis

R split() function size increase issue

I have the following data set

> head(data)
  X    UserID NPS V3 V4 V5                                   Event              V7          Element                            ElementValue 
1 1 254727216  10  0 19 10 nps.agent.14b.no other attempt was made 10/4/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
2 2 298379949   0  0 28 11 nps.agent.14b.no other attempt was made 9/30/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
3 3 254710917   0  0 20 12 nps.agent.14b.no other attempt was made 9/15/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
4 4 238919392   7  0 17  9 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
5 5 144693025  10  0 18 10 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
6 6 249978568   5  0 21 12 nps.agent.14b.no other attempt was made 9/18/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made

When I split the data set as:

data_splitted <- split(data,data$UserID)

The problem here is huge increase in size which exceeds my ram when i try this with the whole data set instead of this sample

> format(object.size(data),units="Mb")
[1] "0.2 Mb"
> format(object.size(data_splitted),units="Mb")
[1] "45.7 Mb"

Any insights regarding why is this happening and if any way to tackle this would be appreciated.

Solution

Try this:

data$UserID <- as.character(data$UserID)
data_splitted <- split(data,data$UserID)

What happenned in your case is that since the ID was numerical, the number was used as an index (position) in the created list, which is obviously not right. Since id's go pretty high in numbers, R filled the gaps with as many empty lists (hence the huge object size). By making the id a character variable, we avoid this.

Another way which would leave the id variable intact inside the 1-line dataframes would be:

data_splitted <- list()
for(i in 1:nrow(data))
  data_splitted[[as.character(data$UserID[i])]] <- data[i,]

To access the elements in the newly created list, you'll need to quote the numbers if you use the $ operator:

data_splitted$"144693025"
data_splitter[["144693025"]]

Another option would be to add characters in front of the numerical id. For instance:

data$UserID <- paste0("id",data$UserID)
data_splitted <- split(data,data$UserID)

Which makes accessing list-items more convenient:

data_splitted$id144693025
data_splitted$id238919392