I am attempting to reproduce some survival analysis results published in a journal. The original results were produced in Stata. Here is the code:
stset time, id(leadid) failure(c_coup)
streg legislature leg_growth_2 gdp_1k chgdpen_fearonlaitin Oil_fearonlaitin postcoldwarlag civiliandictatorshiplag militarydictatorshiplag communist lpopl1_fearonlaitin ethfrac_fearonlaitin relfrac_fearonlaitin age, distribution(weibull) time
Here is my attempt to translate this to R
code:
# load packages
library(survival)
library(survminer)
# load data
data <- read.csv("leader_tvc_2.csv", encoding = "latin1")
# create survival object
surv_object <- Surv(time, c_coup) ~ legislature + leg_growth_2 + gdp_1k + chgdpen_fearonlaitin + Oil_fearonlaitin + postcoldwarlag + civiliandictatorshiplag + militarydictatorshiplag + communist + lpopl1_fearonlaitin + ethfrac_fearonlaitin + relfrac_fearonlaitin + age
# run survreg
fit <- survreg(surv_object, data = data, dist = "weibull")
# examine output
summary(fit)
I am unable to obtain the same results. For example, the estimate for legislature
in the Stata results (the ones I want to produce) is 2.057. In R
, however, it is 2.48223. Any advice would be appreciated. Here is a link to the full data.
Here is the R output:
Call:
survreg(formula = surv_object, data = data, dist = "weibull")
Value Std. Error z p
(Intercept) 3.69422 0.62813 5.88 4.1e-09
legislature 2.48223 0.21580 11.50 < 2e-16
leg_growth_2 4.06103 1.81609 2.24 0.0253
gdp_1k -0.02217 0.03678 -0.60 0.5467
chgdpen_fearonlaitin 0.49579 1.11792 0.44 0.6574
Oil_fearonlaitin 0.24194 0.23799 1.02 0.3094
postcoldwarlag 0.55797 0.31626 1.76 0.0777
civiliandictatorshiplag -1.87331 0.32142 -5.83 5.6e-09
militarydictatorshiplag -1.34963 0.29635 -4.55 5.3e-06
communist 0.88426 0.28211 3.13 0.0017
lpopl1_fearonlaitin -0.00252 0.06636 -0.04 0.9697
ethfrac_fearonlaitin 0.47922 0.28625 1.67 0.0941
relfrac_fearonlaitin 0.67571 0.37836 1.79 0.0741
age 0.01261 0.00704 1.79 0.0733
Log(scale) -0.12430 0.06339 -1.96 0.0499
Scale= 0.883
Weibull distribution
Loglik(model)= -829 Loglik(intercept only)= -978
Chisq= 297.98 on 13 degrees of freedom, p= 6.4e-56
Number of Newton-Raphson Iterations: 18
n=3774 (5911 observations deleted due to missingness)
It is all about data setup for survival analysis. There are two issues:
survreg
treats each observation as one independent sampling unit while your Stata stset time, id(leadid) failure(c_coup)
specifies ID (leadid
) with multiple observations. You need to set up a start-stop type Surv object.survreg
would throw a error message: "start-stop type Surv objects are not supported". Use flexsurv::flexsurvreg
instead:To get the same results as Stata in R, try this:
library(survival)
library(survminer)
library(tidyverse)
library(flexsurv)
data <- read.csv("leader_tvc_2.csv", encoding = "latin1")
df <- data %>%
mutate(time0 = lag(time, default=0), .by = leadid)
surv_object <- Surv(time0, time, c_coup) ~ legislature + leg_growth_2 + gdp_1k + chgdpen_fearonlaitin + Oil_fearonlaitin + postcoldwarlag + civiliandictatorshiplag + militarydictatorshiplag + communist + lpopl1_fearonlaitin + ethfrac_fearonlaitin + relfrac_fearonlaitin + age
fit <- flexsurvreg(surv_object, data=df, dist = "weibull")
> broom::tidy(fit)
# A tibble: 15 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 shape 1.09 0.0697 NA NA
2 scale 13.1 8.61 NA NA
3 legislature 2.06 0.197 10.4 1.01e-25
4 leg_growth_2 3.04 1.81 1.68 4.68e- 2
5 gdp_1k 0.0145 0.0382 0.380 3.52e- 1
6 chgdpen_fearonlaitin 0.804 1.12 0.721 2.35e- 1
7 Oil_fearonlaitin 0.126 0.250 0.505 3.07e- 1
8 postcoldwarlag 0.517 0.329 1.57 5.84e- 2
9 civiliandictatorshiplag -1.28 0.325 -3.95 3.91e- 5
10 militarydictatorshiplag -0.882 0.306 -2.88 1.96e- 3
11 communist 0.480 0.282 1.71 4.40e- 2
12 lpopl1_fearonlaitin 0.0857 0.0664 1.29 9.87e- 2
13 ethfrac_fearonlaitin 0.419 0.298 1.41 7.97e- 2
14 relfrac_fearonlaitin 0.491 0.387 1.27 1.02e- 1
15 age -0.0156 0.00780 -2.00 2.27e- 2
>