Search code examples
rdummy-datamodel.matrix

R model.matrix using same factor set among all columns


I have a set of basketball lineup data with five columns, each sharing the same factor, like so:

head(dat)
              V1             V2            V3            V4              V5
1   MILES,KEATON KINGSLEY,MOSES  BELL,ANTHLON HANNAHS,DUSTY   DURHAM,JABRIL
2   MILES,KEATON KINGSLEY,MOSES  BELL,ANTHLON HANNAHS,DUSTY   DURHAM,JABRIL
3 KINGSLEY,MOSES   BELL,ANTHLON HANNAHS,DUSTY DURHAM,JABRIL   THOMPSON,TREY
4 KINGSLEY,MOSES   BELL,ANTHLON HANNAHS,DUSTY THOMPSON,TREY     BEARD,ANTON
5  THOMPSON,TREY    BEARD,ANTON KOUASSI,WILLY   WHITT,JIMMY WATKINS,MANUALE
6  THOMPSON,TREY    BEARD,ANTON KOUASSI,WILLY   WHITT,JIMMY WATKINS,MANUALE

What I want to do is have each row be a dummy encoding of the current factors shown on the row, like this:

MILES,KEATON  KINGSLEY,MOSES  BELL,ANTHLON  HANNAHS,DUSTY  DURHAM,JABRIL THOMPSON,TREY  BEARD,ANTON  KOUASSI,WILLY  WHITT,JIMMY  WATKINS,MANUALE
           1               1             1              1              1             0            0               0             0               0
           1               1             1              1              1             0            0               0             0               0
           0               1             1              1              1             1            0               0             0               0

However, model.matrix only seems to have a scope of one column; it won't let me share an entire factor set across multiple columns. Following some advice in [this thread][1], I tried:

df <- as.data.frame(lapply(dat,as.factor))
fList <- lapply(names(df),reformulate,intercept=FALSE)
mList <- lapply(fList,sparse.model.matrix,data=df)
br <- do.call(cBind,mList)
head(br)
6 x 31 sparse Matrix of class "dgCMatrix"
   [[ suppressing 31 column names ‘V1BEARD,ANTON’, ‘V1BELL,ANTHLON’, ‘V1KINGSLEY,MOSES’ ... ]]

1 . . . 1 . . . . 1 . . 1 . . . . . . 1 . . . . . . 1 . . . . .
2 . . . 1 . . . . 1 . . 1 . . . . . . 1 . . . . . . 1 . . . . .
3 . . 1 . . . 1 . . . . . . 1 . . . 1 . . . . . . . . . . . 1 .
4 . . 1 . . . 1 . . . . . . 1 . . . . . . . 1 . . 1 . . . . . .
5 . . . . 1 1 . . . . . . . . 1 . . . . . . . . 1 . . . . . . 1
6 . . . . 1 1 . . . . . . . . 1 . . . . . . . . 1 . . . . . . 1

It combines the column name and the factor name. What do I do?


Solution

  • We can try with mtabulate from qdapTools

    library(qdapTools)
    mtabulate(as.data.frame(t(df1)))
    # BELL,ANTHLON DURHAM,JABRIL HANNAHS,DUSTY KINGSLEY,MOSES MILES,KEATON THOMPSON,TREY BEARD,ANTON KOUASSI,WILLY
    #1            1             1             1              1            1             0           0             0
    #2            1             1             1              1            1             0           0             0
    #3            1             1             1              1            0             1           0             0
    #4            1             0             1              1            0             1           1             0
    #5            0             0             0              0            0             1           1             1
    #6            0             0             0              0            0             1           1             1
    #  WATKINS,MANUALE WHITT,JIMMY
    #1               0           0
    #2               0           0
    #3               0           0
    #4               0           0
    #5               1           1
    #6               1           1
    

    Or using base R

     table(rep(1:nrow(df1), ncol(df1)), unlist(df1))