Consider the following mock data:
df <- data.frame(a=c("John", "Susan", "Eric", "John", "Susan"),
b=c("K", NA, "J", "K", "S"),
c=c("Smith", "Johnson", "May", "Smith", "Johnson"))
df$a <- as.character(df$a)
df$b <- as.character(df$b)
df$c <- as.character(df$c)
That looks like this:
> df
a b c
1 John K Smith
2 Susan <NA> Johnson
3 Eric J May
4 John K Smith
5 Susan S Johnson
I generate a column called unique
that holds a unique number on the interaction of the three character variables.
I use an ifelse
statement to only interact column a
and c
if column b
is NA
.
df$unique <- NA
df$unique <- ifelse(is.na(df$b),
as.integer(interaction(df$a, df$c)),
as.integer(interaction(df$a, df$b, df$c)))
This results in:
> df
a b c unique
1 John K Smith 23
2 Susan <NA> Johnson 3
3 Eric J May 10
4 John K Smith 23
5 Susan S Johnson 9
When I use this code to construct the unique
variable in my real data containing several millions of rows, this computation runs for 21 hours.
Are there ways to speed up this performance? Any smarter solutions?
Is the ifelse
statement the bottleneck here?
would this work ?
library(data.table)
dt1 <- as.data.table(df)
dt1[, unique := .GRP, by = names(dt1)]
a b c unique
1: John K Smith 1
2: Susan NA Johnson 2
3: Eric J May 3
4: John K Smith 1
5: Susan S Johnson 4