Search code examples
rcox-regression

Coxph "X matrix deemed to be singular" without perfect classification?


I am trying to carry out a self-controlled case series using the package sccs from here. This is a statistical method which takes "baseline" and "exposed" periods in the timecourse of e.g. a year of a patient's life. The exposed periods could represent exposure to a drug and the outcome measured could be a side-effect of the drug as, indeed, it is in my case.

This package essentially formats the data into intervals of baseline and exposure risk. A patient identifier indivL (factor), interval (integer, number of days), exposure status (0/1), event status (0/1). It then feeds this data into survival::clogit as a model of the form:

event ~ exposure + strata(indivL) + offset(log(interval))

The data fed to clogit is a data frame of the form:

   indivL event eventday  lower  upper interval age   exposure  indiv aevent astart   aend drugtype
 * <fct>  <dbl>    <int>  <dbl>  <dbl>    <dbl> <fct> <fct>    <dbl>  <dbl>  <dbl>  <dbl>     <dbl>
 1 1         0.    22361 22219. 22252.      34. 1     0           1. 22361. 22219. 22460.        0.
 2 1         0.    22361 22253. 22260.       8. 1     1           1. 22361. 22219. 22460.        0.
 3 1         1.    22361 22261. 22460.     200. 1     0           1. 22361. 22219. 22460.        0.
 4 2         0.    22401 22219. 22252.      34. 1     0           1. 22401. 22219. 22460.        0.
 5 2         0.    22401 22253. 22260.       8. 1     1           1. 22401. 22219. 22460.        0.
 6 2         1.    22401 22261. 22460.     200. 1     0           1. 22401. 22219. 22460.        0.
 7 3         0.    31071 30834. 30863.      30. 1     0           2. 31071. 30834. 31075.        0.
 8 3         0.    31071 30864. 30871.       8. 1     1           2. 31071. 30834. 31075.        0.
 9 3         1.    31071 30872. 31075.     204. 1     0           2. 31071. 30834. 31075.        0.
10 4         1.      261   207.   356.     150. 1     0           3.   261.   207.   425.        0.
# ... with 1,211,460 more rows

I have my model working well for giving me a result when using the above. However, I want to add in other independent variables. These are binary categoricals and I have tried them as both 0/1 integers and 2-level factors. One example would be drugtype. In this case the model takes the form:

event ~ exposure + drugtype + strata(indivL) + offset(log(interval))

My error is:

Warning message:
In coxph(formula = Surv(rep(1, 176241L), event) ~ exposure + drugtype +  :
  X matrix deemed to be singular; variable 2

My model is:

--SNIP--
coxph(formula = Surv(rep(1, 176241L), event) ~ exposure + drugtype + 
     strata(indivL) + offset(log(interval)), data = chopdat, method = "exact")

  n= 176049, number of events= 58602 
   (192 observations deleted due to missingness)

             coef exp(coef) se(coef)     z            Pr(>|z|)    
exposure1 0.70760   2.02912  0.01662 42.57 <0.0000000000000002 ***
drugtype       NA        NA  0.00000    NA                  NA    
--SNIP--

As you can see, it does not like drugtype, which is a binary variable.

Having looked around I've come across several sources which suggest the problem is a case of "perfect classification" i.e. one of my variables perfectly predicts the presence of another. However, using xtabs() I get:

> xtabs(~drugtype + event, data = chopdat)

         event
 drugtype      0      1
        0 778306 388279
        1  29344  14625

and

> xtabs(~ exposure + event, data = chopdat)

        event
exposure      0      1
       0 427482 380101
       1 380788  23113

and

> xtabs(~ drugtype + exposure, data = chopdat)

       drugtype
exposure      0      1
       0 777655  29308
       1 388930  14661

Suggesting there is a good distribution and no perfect classification.

Can anyone point me in the right direction for some more information on this? I feel I've reached the limits of what I'm able to do with the documentation and searching for other answers to this question on StackOverflow.

Many thanks.


Solution

  • Right, hello all.

    Many thanks to @Mike for trying to help me. I have found the explanation for this behaviour and it is a quirk of the SCCS method of modelling rather than to do with clogit itself.

    From Farrington, Whitaker and Weldeselassie Self-Controlled Case Series Studies, A Modelling Guide with R:

    Note that the main effect of the covariate is not included in the model formula, since it cannot be estimated in a SCCS model as it drops out of the likelihood.

    So, the main effect cannot be estimate and is thus returned as NA.

    Apologies for posting a question to which I should have been able to find an answer but hopefully this will be of use to anyone else who runs across this "problem".