r dataframe data-manipulation one-hot-encoding dummy-variable

Split variable into multiple multiple factor variables

I have some dataset similar to this:

df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))

I'm looking for a guidance in finding a way to split variable x into multiple categorical variables with range 0-1

In the end it would look like this:

n x A B C D E F G H . . .
1 D 0 0 0 1 0 0 0 0 . . .
2 B 0 1 0 0 0 0 0 0 . . .
3 F 0 0 0 0 0 1 0 0 . . .

In my dataset, there's way more codes in variable x so adding each new variable manually would be too time consuming.

I was thinking about sorting codes in var x and assigning them an unique number each, then creating an iterating loop that creates new variable for each code in variable x. But i feel like i'm overcomplicating things

Solution

A fast and easy way is to use fastDummies::dummy_cols:

fastDummies::dummy_cols(df, "x")

An alternative with tidyverse functions:

library(tidyverse)

df %>% 
  left_join(., df %>% mutate(value = 1) %>% 
              pivot_wider(names_from = x, values_from = value, values_fill = 0) %>% 
              relocate(n, sort(colnames(.)[-1])))

output

> dummmy <- fastDummies::dummy_cols(df, "x")
> colnames(dummy)[-c(1,2)] <- LETTERS
> dummy

    n x A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1   1 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2   2 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3   3 E 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4   4 H 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5   5 T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
6   6 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
7   7 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
8   8 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9   9 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
10 10 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Benchmark Since there are many solutions and the question involves a large dataset, a benchmark might help. The nnet solution is the fastest according to the benchmark.

set.seed(1)
df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))

library(microbenchmark)
bm <- microbenchmark(
  fModel.matrix(),
  fContrasts(),
  fnnet(),
  fdata.table(),
  fFastDummies(),
  fDplyr(),
  times = 10L,
  setup = gc(FALSE)
)
autoplot(bm)