Search code examples
rcorrelationconfidence-intervalrobustcredible-interval

Extract credible intervals for robust correlations in R


I currently know how to use pbcor from the WRS2 package to extract robust correlations. This function calculates the 95% bootstrap confidence intervals around the estimated robust correlation. For consistency with the rest of my analyses and manuscript, I need to extract credible intervals instead of confidence intervals.

How can I extract the 95% credible intervals, instead of the 95% confidence intervals? Is there a way to do this using pbcor?

My dataset contains 210 observations, but here is a subset of the data:

Individual  varA    varB
1   2.9380842   0.09896456
2   2.9380842   -1.38772037
3   -0.6879859  -2.41310243
4   -0.6879859  0.55722346
5   -2.3129564  -1.34140699
6   -2.3129564  -1.75604301
7   -0.4937431  0.78381085
8   -0.4937431  0.38320385
9   -0.8558126  0.82125672
10  -0.8558126  0.06346062
11  -0.9211026  -1.67170174

Corresponding code:

WRS2::pbcor(data$varA, data$varB, ci=TRUE, nboot=1000, beta=0.1) 
>robust correlation coefficient: 0.275
>test statistic: 0.8582
>p-value:0.41307
>bootstrap CI: [-0.3564; 0.7792]

Solution

  • Hi @Blundering Ecologist

    Here is a complete example of estimating Credible Intervals using Bayesian Modeling to compare against the WRS2 based Robust Confidence Intervals: If you use the set.seed you should be able to recreate the data.Your results will be different when you go to the Bayesian part,as it should. My comments are included in the code below.

    > ## generate data
    > set.seed(123)     # for reproducibility
    > x <- seq(1:25)+rnorm(25)     
    > y <- seq(26:50)-7*rnorm(25)  
    > y[10] <- y[10] * 2.5  # introduce outlier in 10th record
    > y[20] <- y[20] * 1.5 # introduce outlier in 20th record
    > 
    > simdat <- cbind(data.frame(x), data.frame(y)) # create data frame
    > 
    > 
    > ## standardize data
    > library(robustHD)      # very useful functions standardize() & robStandardize()
    Loading required package: ggplot2
    Loading required package: perry
    Loading required package: parallel
    Loading required package: robustbase
    > simdat$x_std <- standardize(simdat$x)     # mean and sd
    > simdat$x_std_rob <- robStandardize(simdat$x)  # median and MAD
    > 
    > ## repeat for y
    > simdat$y_std <- standardize(simdat$y)     # uses mean and sd
    > simdat$y_std_rob <- robStandardize(simdat$y)  # uses median and MAD
    > 
    > head(simdat) # to see variable names of the standardized data
              x         y      x_std  x_std_rob        y_std   y_std_rob
    1 0.4395244 12.806853 -1.7617645 -1.4269699  0.003689598  0.00000000
    2 1.7698225 -3.864509 -1.5746770 -1.2805106 -1.705238038 -1.39579772
    3 4.5587083  1.926388 -1.1824599 -0.9734679 -1.111631801 -0.91095903
    4 4.0705084 11.966959 -1.2511183 -1.0272163 -0.082405292 -0.07031957
    5 5.1292877 -3.776704 -1.1022161 -0.9106499 -1.696237444 -1.38844632
    6 7.7150650  3.014750 -0.7385634 -0.6259685 -1.000067292 -0.81983669
    > 
    > ## get uncorrected correlation
    > cor(simdat$x, simdat$y)
    [1] 0.7507123
    > 
    > ## get boot-strapped correlation that corrects for the 2 outliers
    > library(WRS2)
    > corrxy <- WRS2::pbcor(simdat$y, simdat$x, ci=TRUE, nboot=2000, beta=0.1)
    > corrxy
    Call:
    WRS2::pbcor(x = simdat$y, y = simdat$x, beta = 0.1, ci = TRUE, 
        nboot = 2000)
    
    Robust correlation coefficient: 0.7657
    Test statistic: 5.7084
    p-value: 1e-05 
    
    Bootstrap CI: [0.5113; 0.9116]  # Boot-strapped CI
    
    > ## set up bivariate Bayesian regression without intercept
    > ## so we get the pure zero-order correlation
    > library(brms)
    Loading required package: Rcpp
    Loading 'brms' package (version 2.13.5). Useful instructions
    can be found by typing help('brms'). A more detailed introduction
    to the package is available through vignette('brms_overview').
    
    Attaching package: ‘brms’
    
    The following object is masked from ‘package:robustbase’:
    
        epilepsy
    
    The following object is masked from ‘package:stats’:
    
        ar
    
    > library(shinystan) 
    > # gives a lovely visualization of the brms model fit object
    Loading required package: shiny
    
    This is shinystan version 2.5.0
    
    > # in the formula below "y ~ 0 + x_std", 0 ensures there is no intercept
    > mod1 <- brm( y_std ~ 0 + x_std, data=simdat, cores=2, chains=2)
    Compiling Stan program...
    Start sampling
    
    SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
    Chain 1: 
    Chain 1: Gradient evaluation took 2.8e-05 seconds
    Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
    Chain 1: Adjust your expectations accordingly!
    Chain 1: 
    Chain 1: 
    Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
    
    SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
    Chain 2: 
    Chain 2: Gradient evaluation took 2.1e-05 seconds
    Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
    Chain 2: Adjust your expectations accordingly!
    Chain 2: 
    Chain 2: 
    Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
    Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
    Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
    Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
    Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
    Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
    Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
    Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
    Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
    Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
    Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
    Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
    Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
    Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
    Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
    Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
    Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
    Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
    Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
    Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
    Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
    Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
    Chain 2: 
    Chain 2:  Elapsed Time: 0.031892 seconds (Warm-up)
    Chain 2:                0.025839 seconds (Sampling)
    Chain 2:                0.057731 seconds (Total)
    Chain 2: 
    Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
    Chain 1: 
    Chain 1:  Elapsed Time: 0.032274 seconds (Warm-up)
    Chain 1:                0.028699 seconds (Sampling)
    Chain 1:                0.060973 seconds (Total)
    Chain 1: 
    > summary(mod1)
     Family: gaussian 
      Links: mu = identity; sigma = identity 
    Formula: y_std ~ 0 + x_std 
       Data: simdat (Number of observations: 25) 
    Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
             total post-warmup samples = 2000
    
    Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    x_std     0.76      0.14     0.48     1.05 1.00     1187     1030
    
    # Boot-strap CI: 0.51 to 0.91 compared to  (corrects for outliers)
    # Bayesian Credible Interval: 0.48 to 1.05 (does not correct for outliers)
    # Since the Boot-strap CI is within the Bayesian Credible Interval
    # I would use that.
    # Raw Corr: 0.75 vs Bayesian Corr: 0.76 vs Bootstrap Corr: 0.77
    
    Family Specific Parameters: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    sigma     0.69      0.11     0.52     0.95 1.00     1345     1132
    
    Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
    and Tail_ESS are effective sample size measures, and Rhat is the potential
    scale reduction factor on split chains (at convergence, Rhat = 1).
    > # extract posterior samples of population-level effects 
    
    > samples1 <- posterior_samples(mod1, "^b") # this data frame has all the values of correlation
    > head(samples1)
        b_x_std
    1 0.9093316
    2 0.7281373
    3 0.7207291
    4 0.6822180
    5 0.9747108
    6 0.9653564
    
    > samples2 <- posterior_samples(mod1, "sigma") # this data frame has all the values of variance around correlation
    > head(samples2)
          sigma
    1 0.7320897
    2 0.7212673
    3 0.6204091
    4 0.7844105
    5 0.9443782
    6 0.7311916
    
    > launch_shinystan(mod1) # launches in your web browser
    
    > write.csv(samples1,"/home/Documents/Projects/Rcode/rob_corr_brms.csv", row.names = FALSE) # to do more using Excel
    > write.csv(samples2,"/home/Documents/Projects/Rcode/rob_corr_var_brms.csv", row.names = FALSE) # to do more using Excel  
    
    > # To learn more about brms see this link below
    
    http://paul-buerkner.github.io/brms/articles/index.html
    
    Here is the second model run with the robust standardized x & y
    
    > mod_rob <- brm( y_std_rob ~ 0 + x_std_rob, data=simdat, cores=2, chains=2) 
    Compiling Stan program...
    Start sampling
    
    SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
    Chain 1: 
    Chain 1: Gradient evaluation took 2.4e-05 seconds
    Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
    Chain 1: Adjust your expectations accordingly!
    Chain 1: 
    Chain 1: 
    Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
    
    SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
    Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
    Chain 2: 
    Chain 2: Gradient evaluation took 2.7e-05 seconds
    Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
    Chain 2: Adjust your expectations accordingly!
    Chain 2: 
    Chain 2: 
    Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
    Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
    Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
    Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
    Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
    Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
    Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
    Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
    Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
    Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
    Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
    Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
    Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
    Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
    Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
    Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
    Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
    Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
    Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
    Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
    Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
    Chain 1: 
    Chain 1:  Elapsed Time: 0.025874 seconds (Warm-up)
    Chain 1:                0.028535 seconds (Sampling)
    Chain 1:                0.054409 seconds (Total)
    Chain 1: 
    Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
    Chain 2: 
    Chain 2:  Elapsed Time: 0.025316 seconds (Warm-up)
    Chain 2:                0.026648 seconds (Sampling)
    Chain 2:                0.051964 seconds (Total)
    Chain 2: 
    > summary(mod_rob)
     Family: gaussian 
      Links: mu = identity; sigma = identity 
    Formula: y_std_rob ~ 0 + x_std_rob 
       Data: simdat (Number of observations: 25) 
    Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
             total post-warmup samples = 2000
    
    Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    x_std_rob     0.77      0.14     0.50     1.06 1.00     1639     1201
    
    Family Specific Parameters: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    sigma     0.57      0.08     0.43     0.76 1.00     1314      977
    
    Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
    and Tail_ESS are effective sample size measures, and Rhat is the potential
    scale reduction factor on split chains (at convergence, Rhat = 1).
    
    > samples_rob <- posterior_samples(mod_rob, "^b")
    > head(samples_rob)
      b_x_std_rob
    1   0.8917219
    2   0.6036900
    3   0.9898435
    4   0.6937937
    5   0.7883487
    6   0.8781157
    > samples_rob_var <- posterior_samples(mod_rob, "sigma")
    > head(samples_rob_var)
          sigma
    1 0.5646454
    2 0.4547035
    3 0.6541133
    4 0.4691680
    5 0.6478816
    6 0.4777489