I want to create an unique ID in R based on three columns of serial, pnum and daynum so that they create a unique person-day ID.
I am using a large dataset and the do.call(interaction, df1) produces an error: cannot allocate vector of size 11.1gb.
serial pnum daynum
11011202 1 1
11011202 1 2
11011202 4 1
11011202 4 2
11011203 1 1
11011203 1 2
11011207 1 1
11011207 1 2
11011207 2 1
11011207 2 2
11011209 1 1
11011209 1 2
11011209 2 1
11011209 2 2
Any suggestion please?
Maybe a hash function is what you are after.
The code below will use package hashFunction
. It has 3 different hash functions, I have tested with murmur3.32
that produces 32 bit hashes.
First an example usage with the data in the question..
library(hashFunction)
apply(df1, 1, function(x) murmur3.32(paste(x, collapse = "")))
Now a larger dataset.
serial <- rep(11011200 + 1:1000000, each = 4)
n <- length(serial)
pnum = rep(rep(1:2, each = 2), length.out = n)
daynum <- rep(1:2, length.out = n)
df2 <- data.frame(serial, pnum, daynum)
sum(duplicated(df2))
#[1] 0
Tests with the larger df2
. Matrix access times are faster than df's so I coerce df2
to matrix..
system.time({
h <- apply(as.matrix(df2), 1, function(x) murmur3.32(paste(x, collapse = "")))
})
# user system elapsed
# 74.199 0.059 74.289
Now try to reserve memory first and assign the values in a for
loop.
system.time({
h2 <- integer(n)
tmp <- as.matrix(df2)
for(i in seq_len(n))
h2[i] <- murmur3.32(paste(tmp[i, ], collapse = ""))
rm(tmp)
})
# user system elapsed
# 67.321 0.045 67.406
identical(h, h2)
#[1] TRUE
object.size(df2)
#64000984 bytes
object.size(h)
#16000048 bytes
The hash vector is 4 times smaller than the dataframe.
Data.
df1 <- read.table(text = "
serial pnum daynum
11011202 1 1
11011202 1 2
11011202 4 1
11011202 4 2
11011203 1 1
11011203 1 2
11011207 1 1
11011207 1 2
11011207 2 1
11011207 2 2
11011209 1 1
11011209 1 2
11011209 2 1
11011209 2 2
", header = TRUE)