r correlation confidence-interval robust credible-interval

Extract credible intervals for robust correlations in R

I currently know how to use pbcor from the WRS2 package to extract robust correlations. This function calculates the 95% bootstrap confidence intervals around the estimated robust correlation. For consistency with the rest of my analyses and manuscript, I need to extract credible intervals instead of confidence intervals.

How can I extract the 95% credible intervals, instead of the 95% confidence intervals? Is there a way to do this using pbcor?

My dataset contains 210 observations, but here is a subset of the data:

Individual  varA    varB
1   2.9380842   0.09896456
2   2.9380842   -1.38772037
3   -0.6879859  -2.41310243
4   -0.6879859  0.55722346
5   -2.3129564  -1.34140699
6   -2.3129564  -1.75604301
7   -0.4937431  0.78381085
8   -0.4937431  0.38320385
9   -0.8558126  0.82125672
10  -0.8558126  0.06346062
11  -0.9211026  -1.67170174

Corresponding code:

WRS2::pbcor(data$varA, data$varB, ci=TRUE, nboot=1000, beta=0.1) 
>robust correlation coefficient: 0.275
>test statistic: 0.8582
>p-value:0.41307
>bootstrap CI: [-0.3564; 0.7792]

Solution

Hi @Blundering Ecologist

Here is a complete example of estimating Credible Intervals using Bayesian Modeling to compare against the WRS2 based Robust Confidence Intervals: If you use the set.seed you should be able to recreate the data.Your results will be different when you go to the Bayesian part,as it should. My comments are included in the code below.

> ## generate data
> set.seed(123)     # for reproducibility
> x <- seq(1:25)+rnorm(25)     
> y <- seq(26:50)-7*rnorm(25)  
> y[10] <- y[10] * 2.5  # introduce outlier in 10th record
> y[20] <- y[20] * 1.5 # introduce outlier in 20th record
> 
> simdat <- cbind(data.frame(x), data.frame(y)) # create data frame
> 
> 
> ## standardize data
> library(robustHD)      # very useful functions standardize() & robStandardize()
Loading required package: ggplot2
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
> simdat$x_std <- standardize(simdat$x)     # mean and sd
> simdat$x_std_rob <- robStandardize(simdat$x)  # median and MAD
> 
> ## repeat for y
> simdat$y_std <- standardize(simdat$y)     # uses mean and sd
> simdat$y_std_rob <- robStandardize(simdat$y)  # uses median and MAD
> 
> head(simdat) # to see variable names of the standardized data
          x         y      x_std  x_std_rob        y_std   y_std_rob
1 0.4395244 12.806853 -1.7617645 -1.4269699  0.003689598  0.00000000
2 1.7698225 -3.864509 -1.5746770 -1.2805106 -1.705238038 -1.39579772
3 4.5587083  1.926388 -1.1824599 -0.9734679 -1.111631801 -0.91095903
4 4.0705084 11.966959 -1.2511183 -1.0272163 -0.082405292 -0.07031957
5 5.1292877 -3.776704 -1.1022161 -0.9106499 -1.696237444 -1.38844632
6 7.7150650  3.014750 -0.7385634 -0.6259685 -1.000067292 -0.81983669
> 
> ## get uncorrected correlation
> cor(simdat$x, simdat$y)
[1] 0.7507123
> 
> ## get boot-strapped correlation that corrects for the 2 outliers
> library(WRS2)
> corrxy <- WRS2::pbcor(simdat$y, simdat$x, ci=TRUE, nboot=2000, beta=0.1)
> corrxy
Call:
WRS2::pbcor(x = simdat$y, y = simdat$x, beta = 0.1, ci = TRUE, 
    nboot = 2000)

Robust correlation coefficient: 0.7657
Test statistic: 5.7084
p-value: 1e-05 

Bootstrap CI: [0.5113; 0.9116]  # Boot-strapped CI

> ## set up bivariate Bayesian regression without intercept
> ## so we get the pure zero-order correlation
> library(brms)
Loading required package: Rcpp
Loading 'brms' package (version 2.13.5). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').

Attaching package: ‘brms’

The following object is masked from ‘package:robustbase’:

    epilepsy

The following object is masked from ‘package:stats’:

    ar

> library(shinystan) 
> # gives a lovely visualization of the brms model fit object
Loading required package: shiny

This is shinystan version 2.5.0

> # in the formula below "y ~ 0 + x_std", 0 ensures there is no intercept
> mod1 <- brm( y_std ~ 0 + x_std, data=simdat, cores=2, chains=2)
Compiling Stan program...
Start sampling

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.031892 seconds (Warm-up)
Chain 2:                0.025839 seconds (Sampling)
Chain 2:                0.057731 seconds (Total)
Chain 2: 
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.032274 seconds (Warm-up)
Chain 1:                0.028699 seconds (Sampling)
Chain 1:                0.060973 seconds (Total)
Chain 1: 
> summary(mod1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y_std ~ 0 + x_std 
   Data: simdat (Number of observations: 25) 
Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 2000

Population-Level Effects: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
x_std     0.76      0.14     0.48     1.05 1.00     1187     1030

# Boot-strap CI: 0.51 to 0.91 compared to  (corrects for outliers)
# Bayesian Credible Interval: 0.48 to 1.05 (does not correct for outliers)
# Since the Boot-strap CI is within the Bayesian Credible Interval
# I would use that.
# Raw Corr: 0.75 vs Bayesian Corr: 0.76 vs Bootstrap Corr: 0.77

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.69      0.11     0.52     0.95 1.00     1345     1132

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
> # extract posterior samples of population-level effects 

> samples1 <- posterior_samples(mod1, "^b") # this data frame has all the values of correlation
> head(samples1)
    b_x_std
1 0.9093316
2 0.7281373
3 0.7207291
4 0.6822180
5 0.9747108
6 0.9653564

> samples2 <- posterior_samples(mod1, "sigma") # this data frame has all the values of variance around correlation
> head(samples2)
      sigma
1 0.7320897
2 0.7212673
3 0.6204091
4 0.7844105
5 0.9443782
6 0.7311916

> launch_shinystan(mod1) # launches in your web browser

> write.csv(samples1,"/home/Documents/Projects/Rcode/rob_corr_brms.csv", row.names = FALSE) # to do more using Excel
> write.csv(samples2,"/home/Documents/Projects/Rcode/rob_corr_var_brms.csv", row.names = FALSE) # to do more using Excel  

> # To learn more about brms see this link below

http://paul-buerkner.github.io/brms/articles/index.html

Here is the second model run with the robust standardized x & y

> mod_rob <- brm( y_std_rob ~ 0 + x_std_rob, data=simdat, cores=2, chains=2) 
Compiling Stan program...
Start sampling

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: 
Chain 2: Gradient evaluation took 2.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.025874 seconds (Warm-up)
Chain 1:                0.028535 seconds (Sampling)
Chain 1:                0.054409 seconds (Total)
Chain 1: 
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.025316 seconds (Warm-up)
Chain 2:                0.026648 seconds (Sampling)
Chain 2:                0.051964 seconds (Total)
Chain 2: 
> summary(mod_rob)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y_std_rob ~ 0 + x_std_rob 
   Data: simdat (Number of observations: 25) 
Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 2000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
x_std_rob     0.77      0.14     0.50     1.06 1.00     1639     1201

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.57      0.08     0.43     0.76 1.00     1314      977

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

> samples_rob <- posterior_samples(mod_rob, "^b")
> head(samples_rob)
  b_x_std_rob
1   0.8917219
2   0.6036900
3   0.9898435
4   0.6937937
5   0.7883487
6   0.8781157
> samples_rob_var <- posterior_samples(mod_rob, "sigma")
> head(samples_rob_var)
      sigma
1 0.5646454
2 0.4547035
3 0.6541133
4 0.4691680
5 0.6478816
6 0.4777489