Creating Count Process Data Set With Time-Varying Covariates in R

I am working through Chapter 15 of Applied Longitudinal Data-Analysis by Singer and Willett, on Extending the Cox Regression model, but the UCLA website here has no example R code for this chapter. I am trying to re-create the section on time-varying covariates and am stuck on how to create a count process dataset from the person-level dataframe provided. I have looked through the survival package vignette but am having trouble creating my own dataset. Here is toy data modelled on the Singer and Willett example of predictors of time to first cocaine use in men between 17 and 40 years of age. id is ID, ageInit is age of either first cocaine use or censoring, used is the event log, indicating first cocaine use (1) or censoring (0). There is one time invariant predictor earlyMJ, indicating marijuana use prior to 17 years of age. There are also three time-varying covariates that need to inform the creation of the dataset: timeMJ, which is age of first marijuana use after 17, sellMJ which is age of selling marijuana, and odFirst, which is age of first using some other drug. NAs in these predictors indicate the participant did not perform the action in question at any time.

set.seed(1356)
df <- data.frame(id = 1:6,
                 ageInit = c(25,34,40,29,27,40),
                 used = c(0,1,1,0,1,1), 
                 earlyMJ = c(0,0,1,0,1,1),
                 timeMJ = c(18,27,22,21,22,19),
                 sellMJ = c(NA,NA,25,NA,35,NA),
                 odFirst = c(19,22,35,NA,22,34))

Following Therneau et al.'s process from the survival vignette we create a second dataset, in this case re-expressing the outcome variable ageInit, and the time-varying predictors as number of years from study initiation (i.e. 17 years old).

tdata <- with(df, data.frame (id = id,
                              usedTime = ageInit - 17,
                              timeToFirstMJ = timeMJ - 17,
                              timeToSellMJ = sellMJ - 17,
                              timeToFirstOD = odFirst - 17,
                              usedCocaine = used))

The we merge this new data with the original dataset, creating, via the event() call, two new columns expressing time intervals for each of the covariates. We also create, via the tdc() call, a new row for each participant for each time they experienced one of the covariate events.

sdata <- tmerge(df, tdata, 
                id=id, 
                firstUse = event(futime, usedCocaine), 
                t1MJ   =  tdc(timeToFirstMJ),
                t1SMJ  =  tdc(timeToSellMJ),
                t1OD  =  tdc(timeToFirstOD),
                options= list(idname="subject"))
attr(sdata, "tcount")

The issue is that when I run the Cox regression, using both the time-invariant, and a few of the time-varying covariates, I cannot make the model work.

coxph(Surv(tstart, tstop, firstUse) ~ earlyMJ + t1SMJ + t1OD, data= sdata, ties="breslow")

and get nonsensical coefficient and the warning message

In fitter(X, Y, strats, offset, init, control, weights = weights,  :
  Loglik converged before variable  1,3 ; beta may be infinite.

Furthermore I do not even know if this is a proper count process dataframe, because in Singer and Willett they suggest that each participant needs to have a row for every time anyone in the dataset experiences the event.

Any guidance in these matters would be much appreciated.

Solution

A good start is the Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model vignette in the survival package.

The issue is that when I run the Cox regression, using both the time-invariant, and a few of the time-varying covariates, I cannot make the model work.

What you do seems correct at first glance but it might just be the fact that you only have 6 persons and three parameters to estimate?

get nonsensical coefficient and the warning message

The warning makes good sense. The standard error on two of your parameter estimates is greater than 1000. See below.

Furthermore do not even know if this is a proper count process dataframe, because in Singer and Willett they suggest that each participant needs to have a row for every time anyone in the dataset experiences the event.

This is handled internally in coxph.

Here is my code to reproduce your results

#####
# setup data
df <- data.frame(id = 1:6,
                 ageInit = c(25,34,40,29,27,40),
                 used =    c( 0, 1, 1, 0, 1, 1), 
                 earlyMJ = c( 0, 0, 1, 0, 1, 1),
                 timeMJ  = c(18,27,22,21,22,19),
                 sellMJ =  c(NA,NA,25,NA,35,NA),
                 odFirst = c(19,22,35,NA,22,34))
shift_cols <- c("ageInit", "timeMJ", "sellMJ", "odFirst")
df[shift_cols] <- lapply(df[shift_cols], "-", 17)

library(survival)
est_df <- df[, c("id", "ageInit", "earlyMJ", "used")]
est_df <- tmerge(
  est_df, est_df, id = id, start_using = event(ageInit, used))
est_df <- tmerge(
  est_df, df, id = id, 
  t1MJ =  tdc(timeMJ), t1SMJ = tdc(sellMJ), t1OD = tdc(odFirst))

#####
# fit model
fit <- coxph(
  Surv(tstart, tstop, start_using) ~ earlyMJ + t1SMJ + t1OD, data = est_df)
#R> Warning message:
#R> In fitter(X, Y, strats, offset, init, control, weights = weights,  :
#R>   Loglik converged before variable  1,3 ; beta may be infinite. 

summary(fit)
#R> Call:
#R> coxph(formula = Surv(tstart, tstop, start_using) ~ earlyMJ + 
#R>     t1SMJ + t1OD, data = est_df)
#R> 
#R>   n= 17, number of events= 4 
#R> 
#R>              coef exp(coef)  se(coef)     z Pr(>|z|)
#R> earlyMJ 2.047e+01 7.792e+08 2.791e+04 0.001    0.999
#R> t1SMJ   4.744e-16 1.000e+00 1.414e+00 0.000    1.000
#R> t1OD    4.157e+01 1.136e+18 3.883e+04 0.001    0.999
#R> 
#R>         exp(coef) exp(-coef) lower .95 upper .95
#R> earlyMJ 7.792e+08  1.283e-09   0.00000       Inf
#R> t1SMJ   1.000e+00  1.000e+00   0.06255     15.99
#R> t1OD    1.136e+18  8.806e-19   0.00000       Inf
#R> 
#R> Concordance= 1  (se = 0.258 )
#R> Rsquare= 0.273   (max possible= 0.33 )
#R> Likelihood ratio test= 5.42  on 3 df,   p=0.1437
#R> Wald test            = 0  on 3 df,   p=1
#R> Score (logrank) test = 4.14  on 3 df,   p=0.2463