Search code examples
rsurvival-analysiscox-regression

Using categories as new variables?


I am teaching myself R since some days and I am stuck with a cox regression analysis.

I managed to divide each of the the continuous variables I have into 2 categorical groups using the cut() function. Now I wonder if I can combine these categories seperately in a cox regression analysis.

For example: I have 3 variables (A,B,C), each consisting of 2 categories: A-,A+ & B-,B+ & C-,C+. Now if I run the coxph() with naming the variables I only get the results for the "+" categories (which as I understand is because the "-" categories are used as a reference group). However, since C- seems to have a negative effect on survival, I would be more interested in recieving the result for that category.

Also I wonder if there is a way I can define each category as a new group/variable and combine them individually to see the impact of each on the survival? Or is this unnecessary?

Edit: I'll try giving a more specific example, I hope it works :)

#example data:
test<-structure(list(A = c(8, 6, 42, 97, 55, 1, 5, 7, 55, 4), B = c(93, 9, 65, 2, 51, 89, 1, 1, 5, 62), C = c(58, 99, 100, 98, 92, 100, 99, 95, 81, 67), time = c(1.6, 34.6, 1.5, 35.8, 7.7, 38.6, 40.2, 4.7, 37.6, 8.6), event= c(1, 0, 0, 0, 1, 0, 0, 1, 0, 1)))
mydata<-as.data.frame(test)

It should look like so:

 mydata
    A  B   C time status
1   8 93  58  1.6      1
2   6  9  99 34.6      0
3  42 65 100  1.5      0
4  97  2  98 35.8      0
5  55 51  92  7.7      1
6   1 89 100 38.6      0
7   5  1  99 40.2      0
8   7  1  95  4.7      1
9  55  5  81 37.6      0
10  4 62  67  8.6      1

And concerning the questions above, this is what I have done so far:

#load survival package
library("survival")

#define variables
A <- c(mydata[1:10,1])
B <- c(mydata[1:10,2])
C <- c(mydata[1:10,3])
OS <- c(mydata[1:10,4])
Event <- c(mydata[1:10,5])

# dependent and independent variables
y <- Surv(OS, Event)
x <- cbind(A, B, C)
mydata <- data.frame(cbind(x,y))

#Cox proportional hazard model, with the "raw data"
coxph1 <- coxph(y~x,data=mydata, method="breslow")
summary(coxph1)

#categorising the variables

CA=cut(mydata$A, br=c(-1,20,101), labels = c("[A-]", "[A+]"))
CB=cut(mydata$B, br=c(-1,20,101), labels = c("[B-]", "[B+]"))
CC=cut(mydata$C, br=c(-1,96,101), labels = c("[C-]", "[C+]"))

#Cox regression, combined with cut intervals
coxph2=coxph(y~CA+CB+CC, data=mydata, method="breslow")
summary(coxph2)

And the expected output is:

coxph(formula = y ~ x, data = mydata, method = "breslow")

  n= 10, number of events= 4 

         coef  exp(coef)   se(coef)      z Pr(>|z|)
xA  0.0001443  1.0001443  0.0238329  0.006    0.995
xB  0.0104826  1.0105378  0.0211830  0.495    0.621
xC -0.0497490  0.9514682  0.0383305 -1.298    0.194

   exp(coef) exp(-coef) lower .95 upper .95
xA    1.0001     0.9999    0.9545     1.048
xB    1.0105     0.9896    0.9694     1.053
xC    0.9515     1.0510    0.8826     1.026

Concordance= 0.769  (se = 0.167 )
Rsquare= 0.29   (max possible= 0.799 )
Likelihood ratio test= 3.43  on 3 df,   p=0.33
Wald test            = 3.3  on 3 df,   p=0.3476
Score (logrank) test = 4.24  on 3 df,   p=0.2364


coxph(formula = y ~ CA + CB + CC, data = mydata, method = "breslow")

  n= 10, number of events= 4 

             coef  exp(coef)   se(coef)      z Pr(>|z|)
CA[A+] -1.036e+00  3.549e-01  1.262e+00 -0.821    0.412
CB[B+]  4.294e-01  1.536e+00  1.274e+00  0.337    0.736
CC[C+] -2.162e+01  4.095e-10  2.094e+04 -0.001    0.999

       exp(coef) exp(-coef) lower .95 upper .95
CA[A+] 3.549e-01  2.818e+00    0.0299     4.213
CB[B+] 1.536e+00  6.509e-01    0.1266    18.653
CC[C+] 4.095e-10  2.442e+09    0.0000       Inf

Concordance= 0.904  (se = 0.165 )
Rsquare= 0.542   (max possible= 0.799 )
Likelihood ratio test= 7.8  on 3 df,   p=0.05031
Wald test            = 1.15  on 3 df,   p=0.7653
Score (logrank) test = 6.42  on 3 df,   p=0.09288

Solution

  • First off, on several occasions there is unnecessarily convoluted code. The test data can be written:

    mydata <- data.frame(A = c(8, 6, 42, 97, 55, 1, 5, 7, 55, 4), 
                         B = c(93, 9, 65, 2, 51, 89, 1, 1, 5, 62), 
                         C = c(58, 99, 100, 98, 92, 100, 99, 95, 81, 67), 
                         time = c(1.6, 34.6, 1.5, 35.8, 7.7, 38.6, 40.2, 4.7, 37.6, 8.6), 
                         event= c(1, 0, 0, 0, 1, 0, 0, 1, 0, 1))
    

    Instead of OS <- c(mydata[1:10,4]) you can write OS <- mydata$time but there is no need to take them our of the mydata. As you can see, the model is the same as the first one in your question:

    > (coxph1 <- coxph(Surv(time, event) ~ ., data=mydata, method="breslow"))
    Call:
    coxph(formula = Surv(time, event) ~ ., data = mydata, method = "breslow")
    
           coef exp(coef)  se(coef)     z    p
    A  0.000144  1.000144  0.023833  0.01 1.00
    B  0.010483  1.010538  0.021183  0.49 0.62
    C -0.049749  0.951468  0.038331 -1.30 0.19
    
    Likelihood ratio test=3.43  on 3 df, p=0.33
    n= 10, number of events= 4 
    

    Regarding your question of combining the covariates individually - the ~. means using all other covariates. you could specify them specifically with ~ A + B + C or any other combination.

    Regarding changing the reference category - this is only needed with >2 categories. The meaning of the coefficient is the difference between the category and the reference category. If only 2 categories are present, than changing the reference will give the same coefficient, bit with a '-' sign. To change a reference category in a Factor use the relevel function:

     mydata$CA <- cut(mydata$A, br=c(-1,20,101), labels = c("[A-]", "[A+]"))
     mydata$CB <- cut(mydata$B, br=c(-1,20,101), labels = c("[B-]", "[B+]"))
     mydata$CC <- cut(mydata$C, br=c(-1,96,101), labels = c("[C-]", "[C+]"))
    
    mydata$CA <- relevel(mydata$CA, 2)
    > (coxph1 <- coxph(Surv(time, event) ~ CA, data=mydata, method="breslow"))
    Call:
    coxph(formula = Surv(time, event) ~ CA, data = mydata, method = "breslow")
    
            coef exp(coef) se(coef)    z    p
    CA[A-] 0.559     1.749    1.158 0.48 0.63
    

    Hope this helps :)