Search code examples
rdataframetidyversespread

Converting features to dummies


I have this matrix:

quimio = matrix(c(51,33,16,58,29,13,48,42,30,26,38,16), 
            nrow = 4, ncol = 3)

colnames(quimio) = c("Pouca", "Média", "Alta")
rownames(quimio) = c("Tipo I", "Tipo II", "Tipo III", "Tipo IV")

Which looks like this:

          Pouca Média Alta
Tipo I      51    29   30
Tipo II     33    13   26
Tipo III    16    48   38
Tipo IV     58    42   16

I want to turn it into a tibble such that these row and column names are all dummy variables.

I wanted to make a bar chart and got this:

library(tidyverse)

tipo = c("Tipo I", "Tipo II", "Tipo III", "Tipo IV")

tipos = rep(tipo, 3)

quimiotb = as.tibble(quimio)
quimiotb = gather(quimiotb)
quimiotb$tipo = tipos

quimiotb = rename(quimiotb, reacao = key)
quimiotb$reacao = factor(quimiotb$reacao)
quimiotb$tipo = factor(quimiotb$tipo)

This is what I get:

A tibble: 12 x 3
reacao value tipo    
<fct>  <dbl> <fct>   
1 Pouca     51 Tipo I  
2 Pouca     33 Tipo II 
3 Pouca     16 Tipo III
4 Pouca     58 Tipo IV 
5 Média     29 Tipo I  
6 Média     13 Tipo II 
7 Média     48 Tipo III
8 Média     42 Tipo IV 
9 Alta      30 Tipo I  
10 Alta     26 Tipo II 
11 Alta     38 Tipo III
12 Alta     16 Tipo IV 

And while this is quite ok to use for a bar chart with ggplot2 I can't run any model on it - that would require that tipo got spread into 4 columns and reacao in 3. Right now this tibble's first line reads like "51 patients with Tipo I cancer had pouca reacao". I've thought about using spread() but can't find the proper combination of arguments. Any help would be appreciated.

tl;dr

I need to tidy quimiotb and don't know how

EDIT: Expected output should be something like this

  A tibble: Y x 7
  Pouca Media Alta Tipo I Tipo II Tipo III Tipo IV    
  <fct> <fct> <fct> <fct>  <fct>   <fct>     <fct>
1   0     1    0      0      1       0         0
2   1     0    0      1      0       0         0

Solution

  • The modelling routines will create a model.matrix for you internally without you having to specify it so this should be sufficient.

    as.data.frame.table(quimio)
    

    model.matrix can create a model matrix from that but you don't need it as seen in the code below.

    Now you do things like:

    DF <- as.data.frame.table(quimio)
    fm0 <- lm(Freq ~ Var1, DF) # or maybe you want Var2?
    fm1 <- lm(Freq ~ Var1 + Var2, DF) 
    anova(fm0, fm1) # compare
    

    or look at the t tests of the coefficients of Var2 in the output of summary(fm1) to see if they are significantly different from zero.

    Or maybe you want to do a chi squared test on the original data

    chisq.test(quimio)
    

    Anyways there are many modelling functions in R and you now have the data in the form you need and can explore them.