Memory size error when creating unique id using 3 columns

I want to create an unique ID in R based on three columns of serial, pnum and daynum so that they create a unique person-day ID.

I am using a large dataset and the do.call(interaction, df1) produces an error: cannot allocate vector of size 11.1gb.

serial         pnum daynum
11011202        1   1
11011202        1   2
11011202        4   1
11011202        4   2
11011203        1   1
11011203        1   2
11011207        1   1
11011207        1   2
11011207        2   1
11011207        2   2
11011209        1   1
11011209        1   2
11011209        2   1
11011209        2   2

Any suggestion please?

Solution

Maybe a hash function is what you are after.
The code below will use package hashFunction. It has 3 different hash functions, I have tested with murmur3.32 that produces 32 bit hashes.

First an example usage with the data in the question..

library(hashFunction)

apply(df1, 1, function(x) murmur3.32(paste(x, collapse = "")))

Now a larger dataset.

serial <- rep(11011200 + 1:1000000, each = 4)
n <- length(serial)
pnum = rep(rep(1:2, each = 2), length.out = n)
daynum <- rep(1:2, length.out = n)

df2 <- data.frame(serial, pnum, daynum)
sum(duplicated(df2))
#[1] 0

Tests with the larger df2. Matrix access times are faster than df's so I coerce df2 to matrix..

system.time({
  h <- apply(as.matrix(df2), 1, function(x) murmur3.32(paste(x, collapse = "")))
})
#     user    system   elapsed
#   74.199     0.059    74.289

Now try to reserve memory first and assign the values in a for loop.

system.time({
  h2 <- integer(n)
  tmp <- as.matrix(df2)
  for(i in seq_len(n)) 
    h2[i] <- murmur3.32(paste(tmp[i, ], collapse = ""))
  rm(tmp)
})
#     user    system   elapsed
#   67.321     0.045    67.406 

identical(h, h2)
#[1] TRUE

object.size(df2)
#64000984 bytes

object.size(h)
#16000048 bytes

The hash vector is 4 times smaller than the dataframe.

Data.

df1 <- read.table(text = "
serial         pnum daynum
11011202        1   1
11011202        1   2
11011202        4   1
11011202        4   2
11011203        1   1
11011203        1   2
11011207        1   1
11011207        1   2
11011207        2   1
11011207        2   2
11011209        1   1
11011209        1   2
11011209        2   1
11011209        2   2                  
", header = TRUE)