Search code examples
rbioinformaticslinear-regressionglmglmnet

Elastic net with Cox regression


I am trying to perform elastic net with cox regression on 120 samples with ~100k features.

I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.

I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.

Edit: A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.

If someone has an idea or has done this before please share your method (R, Matlab, something else).

Edit 2:

Here is what I used to test: I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.

data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)

The error I got:

Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
  Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
  Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
  Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
  Reached total allocation of 12211Mb: see help(memory.size)

Thank you in advance! :)

Edit 3: Thanks to @marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.


Solution

  • A few pointers:

    a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.

    The package is available in Julia and Python but I'm not sure if that model is available.

    Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.

    There are at least two problems with your code:

    • This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
    • If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
    • If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.

    Hopefully this can help you debug the code. Good luck!