I am working with R version 3.2.3 on a linux machine.
I have a dataset with 1,374,439 observations of 145 variables. I need to transform this dataframe into a binary matrix.
I have looked at different forums and I found a solution with the package reshape2 and the functions melt()
and dcast()
. This works perfectly with a small dataset (I always first tried my codes on a small parts in order to check if it is doing what I want). When I want to use this code on the whole dataset, it does not work anymore.
I have looked at others forums and I have tried (with no success), the following functions:
table()
sparseMatrix()
and as.Matrix()
xtabs()
I have also found the use of the packages dplyr and tidyr for bigger datasets. But I did not succeed. Honestly, I struggled with the understanding too. But it seems that the size of my dataset is the main problem...
The data looks like that (this is a version short version):
Code_1 Code_2 Code_3 Code_4 Code_5 Code_6 Code_7
1 M201 M2187 M670
2 O682 O097 Z370 O48 O759
3 S7211 Z966 Z501
And I would like to have this (a binary matrix):
M201 M2187 M670 O682 O097 Z370 O48 0759 S7211 Z966 Z501
1 1 1 1 0 0 0 0 0 0 0 0
2 0 0 0 1 1 1 1 1 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1
I would also like to be precise, that the blank spaces are not NAs. It is really blank spaces.
What you actually need - is Matrix::sparse.model.matrix()
function. Answers below create dense matrices, which will quickly eat all your ram on this dataset.
Here is simple example:
M = sparse.model.matrix( ~ ., data=data.frame(x = letters , y = LETTERS))
If you do not need intercept, use following formula
M = sparse.model.matrix( ~ -1 + ., data=data.frame(x = letters , y = LETTERS))