I am trying to carry out a self-controlled case series using the package sccs
from here. This is a statistical method which takes "baseline" and "exposed" periods in the timecourse of e.g. a year of a patient's life. The exposed periods could represent exposure to a drug and the outcome measured could be a side-effect of the drug as, indeed, it is in my case.
This package essentially formats the data into intervals of baseline and exposure risk. A patient identifier indivL
(factor), interval
(integer, number of days), exposure
status (0/1), event
status (0/1). It then feeds this data into survival::clogit
as a model of the form:
event ~ exposure + strata(indivL) + offset(log(interval))
The data fed to clogit
is a data frame of the form:
indivL event eventday lower upper interval age exposure indiv aevent astart aend drugtype
* <fct> <dbl> <int> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0. 22361 22219. 22252. 34. 1 0 1. 22361. 22219. 22460. 0.
2 1 0. 22361 22253. 22260. 8. 1 1 1. 22361. 22219. 22460. 0.
3 1 1. 22361 22261. 22460. 200. 1 0 1. 22361. 22219. 22460. 0.
4 2 0. 22401 22219. 22252. 34. 1 0 1. 22401. 22219. 22460. 0.
5 2 0. 22401 22253. 22260. 8. 1 1 1. 22401. 22219. 22460. 0.
6 2 1. 22401 22261. 22460. 200. 1 0 1. 22401. 22219. 22460. 0.
7 3 0. 31071 30834. 30863. 30. 1 0 2. 31071. 30834. 31075. 0.
8 3 0. 31071 30864. 30871. 8. 1 1 2. 31071. 30834. 31075. 0.
9 3 1. 31071 30872. 31075. 204. 1 0 2. 31071. 30834. 31075. 0.
10 4 1. 261 207. 356. 150. 1 0 3. 261. 207. 425. 0.
# ... with 1,211,460 more rows
I have my model working well for giving me a result when using the above. However, I want to add in other independent variables. These are binary categoricals and I have tried them as both 0/1 integers and 2-level factors. One example would be drugtype
. In this case the model takes the form:
event ~ exposure + drugtype + strata(indivL) + offset(log(interval))
My error is:
Warning message:
In coxph(formula = Surv(rep(1, 176241L), event) ~ exposure + drugtype + :
X matrix deemed to be singular; variable 2
My model is:
--SNIP--
coxph(formula = Surv(rep(1, 176241L), event) ~ exposure + drugtype +
strata(indivL) + offset(log(interval)), data = chopdat, method = "exact")
n= 176049, number of events= 58602
(192 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
exposure1 0.70760 2.02912 0.01662 42.57 <0.0000000000000002 ***
drugtype NA NA 0.00000 NA NA
--SNIP--
As you can see, it does not like drugtype
, which is a binary variable.
Having looked around I've come across several sources which suggest the problem is a case of "perfect classification" i.e. one of my variables perfectly predicts the presence of another. However, using xtabs()
I get:
> xtabs(~drugtype + event, data = chopdat)
event
drugtype 0 1
0 778306 388279
1 29344 14625
and
> xtabs(~ exposure + event, data = chopdat)
event
exposure 0 1
0 427482 380101
1 380788 23113
and
> xtabs(~ drugtype + exposure, data = chopdat)
drugtype
exposure 0 1
0 777655 29308
1 388930 14661
Suggesting there is a good distribution and no perfect classification.
Can anyone point me in the right direction for some more information on this? I feel I've reached the limits of what I'm able to do with the documentation and searching for other answers to this question on StackOverflow.
Many thanks.
Right, hello all.
Many thanks to @Mike for trying to help me. I have found the explanation for this behaviour and it is a quirk of the SCCS method of modelling rather than to do with clogit
itself.
From Farrington, Whitaker and Weldeselassie Self-Controlled Case Series Studies, A Modelling Guide with R:
Note that the main effect of the covariate is not included in the model formula, since it cannot be estimated in a SCCS model as it drops out of the likelihood.
So, the main effect cannot be estimate and is thus returned as NA
.
Apologies for posting a question to which I should have been able to find an answer but hopefully this will be of use to anyone else who runs across this "problem".