Power analysis for Receiver Operating Characteristic (ROC) curves in R

I need to perform a priori power analysis to determine the minimum sample size for the study we have designed.

In this study, we will utilize Receiver Operating Characteristic (ROC) curves to assess the effectiveness of three distinct biomarkers (A, B, C) in differentiating the stages of a given disease: normal (without the disease), mild stage, and severe stage.

In addition, these curves will be analyzed separately for participants who carry a specific protein gene (carriers) and those without this risk factor (non-carriers). It is important to note that the biomarkers (A, B, C) will be measured as continuous variables.

I would greatly appreciate it if you could analyze the function implemented in R for calculating power analysis.

My main concern is that, in the implemented function, using 50 or 5 participants for each group yields high power results (> 80). I would need to select very low effect sizes (let's say, 0.0, 0.1, and 0.2) to achieve low power in my analysis. In my field of research, the values of 0.00 (baseline), 0.2 (mild), and 0.40 (severe) are deemed a conservative approach,

Thank you for your assistance!

library(pROC)

set.seed(123)

# Define parameters
n_sim <- 1000            # Number of simulations
n <- 50                  # Small sample size per group for testing
mu <- c(0.00, 0.2, 0.40) # Effect sizes for Normal, Mild, Severe / We set an effect size of zero for Normal because it works as the reference group (baseline)
sigma <- 0.1             # Standard deviation (assuming equal variances)
alpha <- 0.05            # Significance level

# Storage for results
power_results <- data.frame(protein_status = rep(c("carrier", "non-carrier"), each = 3),
                            comparison = rep(c("Normal vs Mild", "Normal vs Severe", "Mild vs Severe"), 2),
                            biomarker = rep(c("A", "B", "C"), each = 6),
                            power = NA)

# Loop over protein status, biomarkers, and pairwise comparisons
for (protein in c("carrier", "non-carrier")) {
  for (biomarker in c("A", "B", "C")) {
    for (comparison in c("Normal vs Mild", "Normal vs Severe", "Mild vs Severe")) {
      
      sig_count <- 0
      
      # Run simulations
      for (i in 1:n_sim) {
        
        # Simulate data for each group
        data_normal <- rnorm(n, mean = mu[1], sd = sigma)
        data_mild <- rnorm(n, mean = mu[2], sd = sigma)
        data_severe <- rnorm(n, mean = mu[3], sd = sigma)
        
        # Select data based on comparison
        if (comparison == "Normal vs Mild") {
          data <- data.frame(
            value = c(data_normal, data_mild),
            group = factor(rep(c("normal", "mild"), each = n))
          )
        } else if (comparison == "Normal vs Severe") {
          data <- data.frame(
            value = c(data_normal, data_severe),
            group = factor(rep(c("normal", "severe"), each = n))
          )
        } else if (comparison == "Mild vs Severe") {
          data <- data.frame(
            value = c(data_mild, data_severe),
            group = factor(rep(c("mild", "severe"), each = n))
          )
        }
        
        # Calculate ROC curve and suppress messages
        roc_res <- suppressMessages(roc(data$group, data$value, levels = rev(levels(data$group))))
        
        # Calculate the confidence interval of the AUC
        ci <- ci.auc(roc_res)
        
        # Debugging: Print AUC and confidence interval for the first few simulations
        if (i <= 10) {
          cat("Sim:", i, "AUC:", auc(roc_res), "CI:", ci, "\n")
        }
        
        # Check if the lower bound of the confidence interval is greater than 0.5
        if (ci[1] > 0.5) {
          sig_count <- sig_count + 1
        }
      }
      
      # Calculate power
      power <- sig_count / n_sim
      
      # Store results
      power_results[power_results$protein_status == protein & power_results$biomarker == biomarker & power_results$comparison == comparison, "power"] <- power
    }
  }
}

# Print results
print(power_results)

Solution

You have high power at small effect sizes and small sample sizes because the variability in the outcome NOT explained by your model is very small. This (essentially tautological) explanation of the source of high power is intended to draw your attention to the variances assumed in your simulation. Further detail below.

Outcome Variability

The standard deviation of your outcome (assumed 0.1 in all of the simulations) is small relative to the effect sizes you are considering. Your baseline/reference group has its data drawn from normal(mean = 0 , sd = 0.1). A property of the normal distribution is that about 95% of its values fall within 2 standard deviations of the mean, so about 95% of the values will be in the interval [-0.2, 0.2]. As only 5% of observations are expected to fall outside of that interval and your smallest effect size is 0.2, you're assuming little overlap between the distribution of outcomes for each group. Accordingly, you'll have a lot of power even with small sample sizes! I illustrate this below.

library(ggplot2)


# Simulation parameters
set.seed(123)
n <- 10000
ref_mean <- 0
ref_sd <- 0.1
effect <- 0.2
effect_sd <- 0

# Simulate outcomes
ref <- data.frame(group = "ref", outcome = rnorm(n = n, mean = ref_mean, sd = ref_sd)) 
alt <- data.frame(group = "alt", outcome = rnorm(n = n, mean = ref_mean + rnorm(n = n, mean = effect, sd = effect_sd), sd = ref_sd))
df <- rbind(ref, alt)                  


# Plot distributions of outcomes
ggplot(df, aes(outcome, fill = group)) + 
  geom_density(alpha = 0.2) + 
  geom_vline(xintercept = 0, linetype = "dotted") +
  geom_vline(xintercept = ref_mean + effect, linetype = "dashed")

The logic of employing blocking and covariate adjustment in randomized controlled trials to improve the precision of estimates is essentially the same reason you have high power.

Effect Variability

You are assuming that the "effect" is constant within a group, which is often inappropriate and reduces variance. For example, Group A may differ from Group B for some reason (let's say B has a disease and A doesn't have the disease but are otherwise identical) that causes their outcomes to differ. You're assuming that effect of the disease on the outcomes for members of Group B is identical for every person in Group B - e.g. every person in B's blood pressure rises by exactly X amount. In reality, there tends to be a distribution of effects (perhaps with a mean effect of 0.2!) as effects tend to vary from person to person based on unmodeled characteristics. When effects aren't variable, the mean effect will be estimated more precisely.

NOTE: One factor possibly contributing to all of this as well is that you are using reference effect/outcome sizes without references to your actual units of measurement. In most medical research, effect sizes are "standardized" (albeit this practice has received much criticism). If you don't appropriately standardize, then using standardized effect sizes in power calculations will produce meaningless results.