Search code examples
rformulacox-regression

R, how to put all the column names of a dataframe into a formula?


I'm trying to apply a multivariate Cox regression analysis in R to my dataset, following this tutorial. In particular, I am trying to apply the following function coxph():

install.packages(c("survival", "survminer"));
library("survival");
library("survminer");
data("lung");

res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data =  lung)
summary(res.cox)

As you can see, in this case the names of the features (age + sex + ph.ecog) have been inserted manually in the formula.

In my case, instead, I have thousands of features, so I cannot add their names manually. I need to find a way to insert them in an automated way. I tried to do it on the previous case, with no success. Here's what I tried:

featureNames <- paste(colnames(lung), collapse = " + ")
res.cox <- coxph(Surv(time, status) ~ featureNames, data =  lung)

And I got this error message:

Error in model.frame.default(formula = Surv(time, status) ~ featureNames,  : 
  variable lengths differ (found for 'featureNames')

Can someone help me? Thanks! I'm using Rversion 3.6.3 on a pc running Linux Ubuntu 18.04.5 LTS/


Solution

  • Use reformulate, first set up a default formula:

    fS <- Surv(time, status) ~ . 
    

    Let's say you know before hand the features:

    colnames(lung)
     [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
     [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"  
    
    features = c("ph.karno","age","meal.cal","wt.loss")
    
    fs = reformulate(features, fS[[2]])
    
    coxph(fs, data =  lung)
    
    Call:
    coxph(formula = fs, data = lung)
    
                   coef  exp(coef)   se(coef)      z     p
    ph.karno -9.152e-03  9.909e-01  7.327e-03 -1.249 0.212
    age       1.629e-02  1.016e+00  1.168e-02  1.395 0.163
    meal.cal  5.087e-06  1.000e+00  2.391e-04  0.021 0.983
    wt.loss  -1.057e-03  9.989e-01  6.884e-03 -0.154 0.878
    
    Likelihood ratio test=5.84  on 4 df, p=0.2113
    n= 171, number of events= 124 
       (57 observations deleted due to missingness)