Search code examples
rmatrixfactors

R - Multilevel Variable into Dummies


I feel like this should be easier.

Assume I have a field that contains a "multivalued" item, e.g. genres of a movie.

I want to break those out into dummies, with the rows that have more than one item getting a dummy in each.

How do I do that in a nice, convenient way?

Bad R Example:

library(tidyverse)

data <- tribble(
  ~column,
  "var1",
  "var1 / var2",
  "var2",
  "var3",
  "var1 / var3",
  "var2 / var3"
)

data %>%
  separate(column, into = c("item1", "item2"), sep = " / ", fill = "right") %>%
  mutate_each(funs(factor(., levels = c("var1", "var2", "var3")))) %>%
  mutate(row = as.factor(row_number())) ->
  intermediate

head(intermediate)
#> # A tibble: 6 × 3
#>    item1  item2    row
#>   <fctr> <fctr> <fctr>
#> 1   var1     NA      1
#> 2   var1   var2      2
#> 3   var2     NA      3
#> 4   var3     NA      4
#> 5   var1   var3      5
#> 6   var2   var3      6

v1 <- xtabs( ~ row + item1, data = intermediate)
v2 <- xtabs( ~ row + item2, data = intermediate)

combined <- v1 + v2

combined
#>    item1
#> row var1 var2 var3
#>   1    1    0    0
#>   2    1    1    0
#>   3    0    1    0
#>   4    0    0    1
#>   5    1    0    1
#>   6    0    1    1

That feels really un-R-like.

Python Example

This is pretty easy to do in Python with sklearn's DictVectorizer. For instance:

import pandas as pd
from sklearn.feature_extraction import DictVectorizer

d = [
    "var1",
    "var1 / var2",
    "var2",
    "var3",
    "var1 / var3",
    "var2 / var3"
]

data = pd.DataFrame(d, columns = ["column"])

col = data.column.str.split(" / ")
col = col.apply(lambda row: {key: 1 for key in row})

transformer = DictVectorizer()
transformer.fit_transform(col).todense()

#> matrix([[ 1.,  0.,  0.],
#>         [ 1.,  1.,  0.],
#>         [ 0.,  1.,  0.],
#>         [ 0.,  0.,  1.],
#>         [ 1.,  0.,  1.],
#>         [ 0.,  1.,  1.]])

I'm really just looking for a "tidy" equivalent in R-land.


Solution

  • you can use splitstackshape

      x<-c("var1",
           "var1 / var2",
           "var2",
           "var3",
           "var1 / var3",
           "var2 / var3"
      )
    
    library(splitstackshape)
    
    splitstackshape:::charMat(strsplit(x, " / "), 0)
    
    
         var1 var2 var3
    [1,]    1    0    0
    [2,]    1    1    0
    [3,]    0    1    0
    [4,]    0    0    1
    [5,]    1    0    1
    [6,]    0    1    1