Search code examples
rdataframedata-manipulationone-hot-encodingdummy-variable

Split variable into multiple multiple factor variables


I have some dataset similar to this:

df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))

I'm looking for a guidance in finding a way to split variable x into multiple categorical variables with range 0-1

In the end it would look like this:

n x A B C D E F G H . . .
1 D 0 0 0 1 0 0 0 0 . . .
2 B 0 1 0 0 0 0 0 0 . . .
3 F 0 0 0 0 0 1 0 0 . . .

In my dataset, there's way more codes in variable x so adding each new variable manually would be too time consuming.

I was thinking about sorting codes in var x and assigning them an unique number each, then creating an iterating loop that creates new variable for each code in variable x. But i feel like i'm overcomplicating things


Solution

  • A fast and easy way is to use fastDummies::dummy_cols:

    fastDummies::dummy_cols(df, "x")
    

    An alternative with tidyverse functions:

    library(tidyverse)
    
    df %>% 
      left_join(., df %>% mutate(value = 1) %>% 
                  pivot_wider(names_from = x, values_from = value, values_fill = 0) %>% 
                  relocate(n, sort(colnames(.)[-1])))
    

    output

    > dummmy <- fastDummies::dummy_cols(df, "x")
    > colnames(dummy)[-c(1,2)] <- LETTERS
    > dummy
    
        n x A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    1   1 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    2   2 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
    3   3 E 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    4   4 H 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    5   5 T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
    6   6 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
    7   7 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
    8   8 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    9   9 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    10 10 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
    

    Benchmark Since there are many solutions and the question involves a large dataset, a benchmark might help. The nnet solution is the fastest according to the benchmark.

    set.seed(1)
    df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))
    
    library(microbenchmark)
    bm <- microbenchmark(
      fModel.matrix(),
      fContrasts(),
      fnnet(),
      fdata.table(),
      fFastDummies(),
      fDplyr(),
      times = 10L,
      setup = gc(FALSE)
    )
    autoplot(bm)
    

    enter image description here