Search code examples
rdatasetpca

Same values for PCA Loadings results


I've recently performed a Principle component analysis for my masters thesis where I have 25 network datasets, formatted into graphs and applied 5 measurements to each graph. The measurements were formatted into a table where the rows are datasets and the columns are the results, as shown below:

enter image description here

I then scaled the results to ensure that they are centered to have mean zero (according to An Introduction to Statisical learning, G. James, 2013), with this function:

dat <- data.frame(lapply(measures, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/100)))

This scale function is applied by each measure's standard deviation. I then applied PCA using the princomp function in R, princomp(dat, cor = T, scores = T) which returned these loading results:

Loadings:
                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Transitivity    0.585  0.412  0.246  0.136  0.640
Reciprocity     0.540 -0.145 -0.336 -0.750 -0.111
centralization -0.600  0.280        -0.582  0.469
density                0.327 -0.893  0.261  0.146
assortativity          0.790  0.159 -0.111 -0.581

                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

I would like to ask what would cause the SS loadings and Proportion Variables to have exactly the same results? I'm not sure if this is a discrepancy in my data, the scaling methods I'm using or if this is even something I should worry about. I see that someone had similar results in this query, but did not discuss it, so perhaps it's normal? Any explanation for the impact this has will be much appreciated.

Biplot:

enter image description here

The Screeplot is also doesn't make much sense, since I expected a exponential drop-off, I assume this is a reflection of the loadings results. Screeplot:

enter image description here


Solution

  • I suppose the first question you would like to have answered is what the SS Loadings are. These are the sums of the squares of the loadings - geometrically, they are the square of the length of each of the loading vectors (the length of a vector is the square root of the sum of the squared components). From a technical perspective, the eigenvectors (or loadings) form a basis of R5 and each of these loadings have been normalised so that the sum of the squares of the elements (the square of length of each) equals 1. You can think of it as a best practice of sorts I suppose.

    In short, I wouldn't be too bothered by this.

    I would suggest achieving the result from first principles as below.

    #original data
    df <- data.frame('transitivity'=c(34,8,8,37,15,29), 'reciprocity'=c(20, 34, 34, 25, 20, 7), 'centralization'=c(100, 99,99,100,99,99), 'density'=c(34, 7,7,2,3,0.7), 'assortativity'=c(-48, -53, -53, -33, 14, -45))
    #scale according to the OP's procedure.
    dat <- data.frame(lapply(df, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/100)))
    #calculate correlation matrix.
    cormat <- cor(dat)
    #diagonalise
    pca <- eigen(cormat)
    #show that result is normalised. 
    apply(pca$vectors, 2, function(x) sum(x^2)) #Result will sum to 1 regardless of whether we use margin 1 or 2. Neat excercise to prove why. 
    #calculate % of var explained by each component. 
    pc_var <- pca$values/5*100
    barplot(pc_var)
    

    I am going to leave the interpretation of the results to you!