Search code examples
rmatrixbinarylarge-data

Creating Binary Matrix in R with a large dataset


I am working with R version 3.2.3 on a linux machine.

I have a dataset with 1,374,439 observations of 145 variables. I need to transform this dataframe into a binary matrix.

I have looked at different forums and I found a solution with the package reshape2 and the functions melt() and dcast(). This works perfectly with a small dataset (I always first tried my codes on a small parts in order to check if it is doing what I want). When I want to use this code on the whole dataset, it does not work anymore.

I have looked at others forums and I have tried (with no success), the following functions:

  • table()
  • sparseMatrix() and as.Matrix()
  • xtabs()

I have also found the use of the packages dplyr and tidyr for bigger datasets. But I did not succeed. Honestly, I struggled with the understanding too. But it seems that the size of my dataset is the main problem...

The data looks like that (this is a version short version):

    Code_1  Code_2  Code_3  Code_4  Code_5  Code_6  Code_7
1   M201            M2187   M670
2   O682            O097    Z370             O48    O759
3   S7211   Z966            Z501

And I would like to have this (a binary matrix):

    M201    M2187   M670   O682   O097  Z370    O48   0759    S7211  Z966    Z501
1   1          1       1      0      0     0      0      0        0     0       0
2   0          0       0      1      1     1      1      1        0     0       0
3   0          0       0      0      0     0      0      0        1     1       1

I would also like to be precise, that the blank spaces are not NAs. It is really blank spaces.


Solution

  • What you actually need - is Matrix::sparse.model.matrix() function. Answers below create dense matrices, which will quickly eat all your ram on this dataset.

    Here is simple example:

    M = sparse.model.matrix( ~ ., data=data.frame(x = letters , y = LETTERS))
    

    If you do not need intercept, use following formula

    M = sparse.model.matrix( ~ -1 + ., data=data.frame(x = letters , y = LETTERS))