Search code examples
rdataframematrixunique

Memory size error when creating unique id using 3 columns


I want to create an unique ID in R based on three columns of serial, pnum and daynum so that they create a unique person-day ID.

I am using a large dataset and the do.call(interaction, df1) produces an error: cannot allocate vector of size 11.1gb.

serial         pnum daynum
11011202        1   1
11011202        1   2
11011202        4   1
11011202        4   2
11011203        1   1
11011203        1   2
11011207        1   1
11011207        1   2
11011207        2   1
11011207        2   2
11011209        1   1
11011209        1   2
11011209        2   1
11011209        2   2

Any suggestion please?


Solution

  • Maybe a hash function is what you are after.
    The code below will use package hashFunction. It has 3 different hash functions, I have tested with murmur3.32 that produces 32 bit hashes.

    First an example usage with the data in the question..

    library(hashFunction)
    
    apply(df1, 1, function(x) murmur3.32(paste(x, collapse = "")))
    

    Now a larger dataset.

    serial <- rep(11011200 + 1:1000000, each = 4)
    n <- length(serial)
    pnum = rep(rep(1:2, each = 2), length.out = n)
    daynum <- rep(1:2, length.out = n)
    
    df2 <- data.frame(serial, pnum, daynum)
    sum(duplicated(df2))
    #[1] 0
    

    Tests with the larger df2. Matrix access times are faster than df's so I coerce df2 to matrix..

    system.time({
      h <- apply(as.matrix(df2), 1, function(x) murmur3.32(paste(x, collapse = "")))
    })
    #     user    system   elapsed
    #   74.199     0.059    74.289
    

    Now try to reserve memory first and assign the values in a for loop.

    system.time({
      h2 <- integer(n)
      tmp <- as.matrix(df2)
      for(i in seq_len(n)) 
        h2[i] <- murmur3.32(paste(tmp[i, ], collapse = ""))
      rm(tmp)
    })
    #     user    system   elapsed
    #   67.321     0.045    67.406 
    
    identical(h, h2)
    #[1] TRUE
    
    object.size(df2)
    #64000984 bytes
    
    object.size(h)
    #16000048 bytes
    

    The hash vector is 4 times smaller than the dataframe.

    Data.

    df1 <- read.table(text = "
    serial         pnum daynum
    11011202        1   1
    11011202        1   2
    11011202        4   1
    11011202        4   2
    11011203        1   1
    11011203        1   2
    11011207        1   1
    11011207        1   2
    11011207        2   1
    11011207        2   2
    11011209        1   1
    11011209        1   2
    11011209        2   1
    11011209        2   2                  
    ", header = TRUE)