Search code examples
rperformancedataframecharacterinteraction

R - Faster solutions for base::interaction with character variables?


Consider the following mock data:

df <- data.frame(a=c("John", "Susan", "Eric", "John", "Susan"), 
                 b=c("K", NA, "J", "K", "S"), 
                 c=c("Smith", "Johnson", "May", "Smith", "Johnson"))
df$a <- as.character(df$a)
df$b <- as.character(df$b)
df$c <- as.character(df$c)

That looks like this:

> df
      a    b       c 
1  John    K   Smith 
2 Susan <NA> Johnson 
3  Eric    J     May 
4  John    K   Smith  
5 Susan    S Johnson  

I generate a column called unique that holds a unique number on the interaction of the three character variables.

I use an ifelse statement to only interact column a and c if column b is NA.

df$unique <- NA
df$unique <- ifelse(is.na(df$b), 
             as.integer(interaction(df$a, df$c)),
             as.integer(interaction(df$a, df$b, df$c)))

This results in:

> df
      a    b       c unique
1  John    K   Smith     23
2 Susan <NA> Johnson      3
3  Eric    J     May     10
4  John    K   Smith     23
5 Susan    S Johnson      9

When I use this code to construct the unique variable in my real data containing several millions of rows, this computation runs for 21 hours.

Are there ways to speed up this performance? Any smarter solutions?

Is the ifelse statement the bottleneck here?


Solution

  • would this work ?

    library(data.table)
    dt1 <- as.data.table(df)
    dt1[, unique := .GRP, by = names(dt1)]
    
           a  b       c unique
    1:  John  K   Smith      1
    2: Susan NA Johnson      2
    3:  Eric  J     May      3
    4:  John  K   Smith      1
    5: Susan  S Johnson      4