Memory size error when creating unique id using 3 columns

I want to create an unique ID in R based on three columns of serial, pnum and daynum so that they create a unique person-day ID.

I am using a large dataset and the, df1) produces an error: cannot allocate vector of size 11.1gb.

serial         pnum daynum
11011202        1   1
11011202        1   2
11011202        4   1
11011202        4   2
11011203        1   1
11011203        1   2
11011207        1   1
11011207        1   2
11011207        2   1
11011207        2   2
11011209        1   1
11011209        1   2
11011209        2   1
11011209        2   2

Any suggestion please?


  • Maybe a hash function is what you are after.
    The code below will use package hashFunction. It has 3 different hash functions, I have tested with murmur3.32 that produces 32 bit hashes.

    First an example usage with the data in the question..

    apply(df1, 1, function(x) murmur3.32(paste(x, collapse = "")))

    Now a larger dataset.

    serial <- rep(11011200 + 1:1000000, each = 4)
    n <- length(serial)
    pnum = rep(rep(1:2, each = 2), length.out = n)
    daynum <- rep(1:2, length.out = n)
    df2 <- data.frame(serial, pnum, daynum)
    #[1] 0

    Tests with the larger df2. Matrix access times are faster than df's so I coerce df2 to matrix..

      h <- apply(as.matrix(df2), 1, function(x) murmur3.32(paste(x, collapse = "")))
    #     user    system   elapsed
    #   74.199     0.059    74.289

    Now try to reserve memory first and assign the values in a for loop.

      h2 <- integer(n)
      tmp <- as.matrix(df2)
      for(i in seq_len(n)) 
        h2[i] <- murmur3.32(paste(tmp[i, ], collapse = ""))
    #     user    system   elapsed
    #   67.321     0.045    67.406 
    identical(h, h2)
    #[1] TRUE
    #64000984 bytes
    #16000048 bytes

    The hash vector is 4 times smaller than the dataframe.


