Search code examples
rdataframesurvival-analysiscox-regression

How to structure data for survival analysis with multiple or recurring events in R?


Background

I have a project in which I want to compare the amount of time to a health event Y between a group of people exposed to a prescription drug treatment X and an otherwise similar but treatment-unexposed group (X=1 and X=0, respectively), all while controlling for a set of covariates C.

The data I have to answer this question are health insurance claims (public insurance) in a certain geographical area during a 7-year interval of time.

  • Each row in the dataset represents a claim.
  • The event Y is coded as a diagnostic code, while treatment X is coded with a sort of drug serial number used by my country's regulator to identify prescribed drugs.
  • Individual patients are coded with unique ID numbers.
  • Covariates C appear in every row in which a patient has a claim. So for instance, their geographical area geo, a 3-digit code, will appear in every claim that a certain ID has.

Because patients are able to experience multiple events and can appear in the database more than once (with or without eligible exposures or events), I've chosen as my analytic method a Cox model with shared frailty on ID to account for within-subject (within-ID) correlation. I'm not an expert in survival analysis, though, and I've never done any work with random effects. I'll be analyzing the data in R, using its survival package and methods.

The Problem

I'm not sure how to organize my data to identify the exposed and those with an event. Specifically, do I mark someone who's X=1 one one claim X=1 only on the claim in which their exposures appear, or on every one of their claims? (Same applies with outcomes: is someone with an eligible outcome marked Y=1 in all their claims, or just in the claim in which the relevant diagnostic code appears?)

To illustrate, here's a dummy dataset representing the first way to go. Let's call this Option 1:

  ID   | claim_date  | geo  | drug  | exposed | diagnostic_cd |  event  |
-------+-------------+------|-------|---------|---------------|---------|
 001   |  2011-01-30 | 123  |       |    0    |      LZ13     |    1    |
 001   |  2012-04-12 | 123  | D57   |    1    |      SS24     |    0    |
 001   |  2014-06-27 | 123  | A60   |    0    |               |    0    |
 002   |  2017-09-03 | 456  | D57   |    1    |      MN45     |    1    |
 002   |  2018-12-25 | 456  | C08   |    1    |      MN45     |    1    |

Here, the exposed are marked as such in the indicator variable exposed, and they're only marked in claims in which they have an "exposure event" -- an eligible drug. In the case of ID 001, that's D57, and for ID 002 that's D57 and C08. The same holds for the indicator event, which equals 1 when there's an eligible diagnostic_cd (LZ13 for ID 001, and MN45 for ID 002.)

By contrast, here's the other way to go -- Option 2:

  ID   | claim_date  | geo  | drug  | exposed | diagnostic_cd |  event  |
-------+-------------+------|-------|---------|---------------|---------|
 001   |  2011-01-30 | 123  |       |    1    |      LZ13     |    1    |
 001   |  2012-04-12 | 123  | D57   |    1    |      SS24     |    1    |
 001   |  2014-06-27 | 123  | A60   |    1    |               |    1    |
 002   |  2017-09-03 | 456  | D57   |    1    |      MN45     |    1    |
 002   |  2018-12-25 | 456  | C08   |    1    |      MN45     |    1    |

Here, having any eligible drug or diagnostic_cd will mark you exposed=1 or event=1 in every row in which your ID appears.

My intuition is that Option 1 is the way to go, but I'm not really able to explain why, and I'm sitting here doubting myself.

Also, this may well be a question that's better suited for CrossValidated, so just let me know if that's the case and I'll post it over there.

Any thoughts?


Solution

  • The most common approach is to have an exposure (in this case, a binary exposure) coded 1 or 0 and an event coded 1 or 0. This means option 1 seems to be a more standard approach for modelling any survival analysis dataset.

    There are many methods to analyse survival analyses with recurring events, a frailty model is just one of them. One problem with a frailty model is that it assumes all events are equal, which might not always be the case. An excellent tutorial is this paper by DAF Amorim in the International Journal of Epidemiology (https://academic.oup.com/ije/article/44/1/324/654595). The authors provide a sample dataset and code in R, Stata and SAS.

    What your data seems to be missing is the start date, which is essential for any survival analysis.