I have some dataset similar to this:
df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))
I'm looking for a guidance in finding a way to split variable x
into multiple categorical variables with range 0-1
In the end it would look like this:
n x A B C D E F G H . . .
1 D 0 0 0 1 0 0 0 0 . . .
2 B 0 1 0 0 0 0 0 0 . . .
3 F 0 0 0 0 0 1 0 0 . . .
In my dataset, there's way more codes in variable x so adding each new variable manually would be too time consuming.
I was thinking about sorting codes in var x and assigning them an unique number each, then creating an iterating loop that creates new variable for each code in variable x. But i feel like i'm overcomplicating things
A fast and easy way is to use fastDummies::dummy_cols
:
fastDummies::dummy_cols(df, "x")
An alternative with tidyverse functions:
library(tidyverse)
df %>%
left_join(., df %>% mutate(value = 1) %>%
pivot_wider(names_from = x, values_from = value, values_fill = 0) %>%
relocate(n, sort(colnames(.)[-1])))
output
> dummmy <- fastDummies::dummy_cols(df, "x")
> colnames(dummy)[-c(1,2)] <- LETTERS
> dummy
n x A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 1 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2 2 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3 3 E 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 H 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
6 6 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
7 7 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
8 8 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 9 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
10 10 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Benchmark Since there are many solutions and the question involves a large dataset, a benchmark might help. The nnet solution is the fastest according to the benchmark.
set.seed(1)
df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))
library(microbenchmark)
bm <- microbenchmark(
fModel.matrix(),
fContrasts(),
fnnet(),
fdata.table(),
fFastDummies(),
fDplyr(),
times = 10L,
setup = gc(FALSE)
)
autoplot(bm)