Search code examples
rstatistics-bootstrap

Resampling not producing expected result of principal component analysis


I am trying following code to produce confidence intervals of principal component analysis using resampling with replacement (like bootstrap). I am using first 4 columns of iris dataset:

The prcomp function produces following output:

> mydf = iris[1:4]
> print(prcomp(mydf))
Standard deviations:
[1] 2.0562689 0.4926162 0.2796596 0.1543862

Rotation:
                     PC1         PC2         PC3        PC4
Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

Using resampling with replacement:

> times = 1000
> ll = list()
> for(i in 1:times) {
+ tempdf =  mydf[sample(nrow(mydf), replace = TRUE), ]
+ ll[[length(ll)+1]] = prcomp(tempdf)$rotation
+ }
> 
> dd = data.frame(apply(simplify2array(ll), 1:2, mean))
> print(dd)
                      PC1          PC2          PC3          PC4
Sepal.Length  0.005574165 -0.039480258  0.044537991  0.007778055
Sepal.Width  -0.002587333 -0.040273812 -0.050793200 -0.005473271
Petal.Length  0.015681233  0.010952361 -0.005769051 -0.011351172
Petal.Width   0.006513656  0.008296928 -0.041805210  0.019109323

Determining lower confidence interval:

> ddlower = data.frame(apply(simplify2array(ll), 1:2, quantile, probs=0.025))
> print(ddlower)
                    PC1        PC2        PC3        PC4
Sepal.Length -0.3859257 -0.7274809 -0.6560139 -0.3807826
Sepal.Width  -0.1127749 -0.7907801 -0.6818251 -0.3941001
Petal.Length -0.8633386 -0.2058064 -0.1333520 -0.4919584
Petal.Width  -0.3702979 -0.1328146 -0.6203322 -0.8088710

Determining upper confidence interval:

> ddupper = data.frame(apply(simplify2array(ll), 1:2, quantile, probs=0.975))
> print(ddupper)
                   PC1       PC2       PC3       PC4
Sepal.Length 0.3860431 0.7250412 0.6632126 0.3831889
Sepal.Width  0.1111863 0.7993649 0.6758156 0.3987939
Petal.Length 0.8638549 0.2106540 0.1318556 0.4915670
Petal.Width  0.3721362 0.1510708 0.6246988 0.8083421

I find that the loading values are very different. Moreover, the confidence intervals are around 0 for all the variables and components. I checked with other (large) datasets also and the findings are very similar. From these confidence intervals none of the loadings are significantly different from 0. There is obviously some error in the code but I cannot seem to find it. Thanks for your help.


Solution

  • Given that the sign of an Eigenvector is not defined (you can flip the configuration and have the same result), it doesn't make sense to form a confidence interval on the the signed value of the loading.

    Instead compute the confidence interval on the absolute value of the loading, not the signed value.

    Think what happens to your interval when the Eigenvector for say Sepal.Length flips from ~ -0.3 to ~ +0.3? The loading is similar in both cases when considered from an absolute size point of view. When you look at the actual signed value however, it would be logical for the loading to be on average 0 as you are averaging a lot of ~-0.3s and ~0.3s.

    To visualise why your original attempt failed, run:

    set.seed(1)
    mydf <- iris[1:4]
    times <- 1000
    ll <- vector(mode = "list", length = times)
    for (i in seq_len(times)) {
      tempdf  <- mydf[sample(nrow(mydf), replace = TRUE), ]
      ll[[i]] <- prcomp(tempdf)$rotation
    }
    

    This is effectively your code, modified to suit my sensibilities. Now extract the loading for Sepal.Length on PC1 and draw a histogram of the values:

    hist(sapply(ll, `[`, 1, 1))
    

    which produces

    enter image description here

    Instead compute the confidence interval on the absolute value of the loading, not the signed value.

    For example

    set.seed(1)
    mydf <- iris[1:4]
    times <- 1000
    ll <- vector(mode = "list", length = times)
    for (i in seq_len(times)) {
      tempdf  <- mydf[sample(nrow(mydf), replace = TRUE), ]
      ll[[i]] <- abs(prcomp(tempdf)$rotation) ## NOTE: abs(...)
    }
    

    This gives:

    > data.frame(apply(simplify2array(ll), 1:2, quantile, probs = 0.025))
                        PC1         PC2        PC3       PC4
    Sepal.Length 0.33066830 0.578558222 0.45955051 0.2252653
    Sepal.Width  0.05211013 0.623424084 0.49591685 0.2351746
    Petal.Length 0.84823899 0.133137927 0.01226608 0.4607265
    Petal.Width  0.34284824 0.007403214 0.44932031 0.6780493
    
    > data.frame(apply(simplify2array(ll), 1:2, quantile, probs = 0.975))
                       PC1       PC2       PC3       PC4
    Sepal.Length 0.3891499 0.7443276 0.6690553 0.3898237
    Sepal.Width  0.1186205 0.7988607 0.7010495 0.4083784
    Petal.Length 0.8653324 0.2153410 0.1450756 0.4933340
    Petal.Width  0.3742441 0.1645692 0.6350899 0.8154254