Search code examples
rsurvival-analysis

R: Using Log Rank Test (survdiff)


OK, so I have a dataframe that looks like this:

head(exprs, 21)

   sample expr              ID X_OS
1     BIX high TCGA_DM_A28E_01   26
2     BIX high TCGA_AY_6197_01   88
3     BIX high TCGA_HB_KH8H_01  553
4     BIX  low TCGA_K4_6303_01  256
5     BIX  low TCGA_F4_6703_01  491
6     BIX  low TCGA_Y7_PIK2_01  177
7     BIX  low TCGA_A6_5657_01  732
8     HEF high TCGA_DM_A28E_01   26
9     HEF high TCGA_AY_6197_01   88
10    HEF high TCGA_F4_6703_01  491
11    HEF high TCGA_HB_KH8H_01  553
12    HEF  low TCGA_K4_6303_01  256
13    HEF  low TCGA_Y7_PIK2_01  177
14    HEF  low TCGA_A6_5657_01  732
15    TUR high TCGA_DM_A28E_01   26
16    TUR high TCGA_F4_6703_01  491
17    TUR high TCGA_Y7_PIK2_01  177
18    TUR  low TCGA_K4_6303_01  256
19    TUR  low TCGA_AY_6197_01   88
20    TUR  low TCGA_HB_KH8H_01  553
21    TUR  low TCGA_A6_5657_01  732

Simply, for each sample, there are 7 patients, each with a survival time (X_OS) and expression level high or low (expr). In the code below, I wish to take the first sample and run it through the survdiff function, with the outputs going to dfx. However, I'm new to survival analysis and I'm not sure how to use the parameters of the survdiff function. I wish to compare high and low expression groups for each sample. How can I edit the function expfun to yield the survdiff output I need? In addition, ideally I'd love to get the pvalues out of it, but I can work on that in a later step. Thank you!

expfun = function(x) {
  survdiff(Surv(x$X_OS, x$expr))
}

dfx <- pblapply(split(exprs[c("expr", "X_OS")], exprs$sample), expfun)

Solution

  • Try this. I added a proper Surv() call because you only had times and no status argument and I made it into a formula (with the predictor on the other side of the tilde) because Surv function expects status as its second argument and survdiff expects a formula as its first argument. That means you need to use the regular R regression calling convention where column names are used as the formula tokens and the dataframe is given to the data argument. If you had a censoring variable, it would be put in as the second Surv argument rather than the 1's that I have in there now.

     expfun = function(x) {
      survdiff( Surv( X_OS, rep(1,nrow(x)) ) ~ expr, data=x)
    }
    
    dfx <- lapply(split(exprs[c("expr", "X_OS")], exprs$sample), expfun)
    

    This is the result from print.survdiff:

    > dfx
    $BIX
    Call:
    survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
    
              N Observed Expected (O-E)^2/E (O-E)^2/V
    expr=high 3        3     2.05     0.446     0.708
    expr=low  4        4     4.95     0.184     0.708
    
     Chisq= 0.7  on 1 degrees of freedom, p= 0.4 
    
    $HEF
    Call:
    survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
    
              N Observed Expected (O-E)^2/E (O-E)^2/V
    expr=high 4        4     3.14     0.237      0.51
    expr=low  3        3     3.86     0.192      0.51
    
     Chisq= 0.5  on 1 degrees of freedom, p= 0.475 
    
    $TUR
    Call:
    survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
    
              N Observed Expected (O-E)^2/E (O-E)^2/V
    expr=high 3        3     1.75     0.902      1.41
    expr=low  4        4     5.25     0.300      1.41
    
     Chisq= 1.4  on 1 degrees of freedom, p= 0.235 
    

    Note that you can see the code to produce the print output with:

    getAnywhere(print.survdiff)