I have an environmental data set consisting of continuous, non-normally distributed observations. My goal is to construct a latent variable from the measured 5 variables. The theory behind this construct seems sound, but I’m stuck with getting the idea formalized.
The 5 variables are strongly correlated (bivariate correlation .75-.95), and as I understand, this is problem for structural equation modeling? I’ve tried SEM with the ‘lavaan’ package in R, but I’m getting nowhere. So should I stick with SEM and try to iterate the model, or should I use some other approach?
Really more of a statistics question than an R question, but nevertheless...
Consider principal components analysis, which transforms a set of correlated variables into a new set of uncorrelated (orthogonal) variables (the principal components, PC). It is usually the case that a small number of PC's explain nearly all the variability in the original dataset. Using the built-in iris
dataset in R:
data <- iris[,1:4] # iris dataset, excluding species column
pca <- prcomp(data,retx=T, scale.=T) # principal components analysis
PC <- pca$x # the principal components
summary(pca)
Produces this:
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
So PC1, the first principal component, explains 73% of the variation in the dataset, the first two (PC1 and PC2) together explain 96% of the variation.
Edit: Responding to @erska's comment/question below:
cor(data,PC)
Produces this:
PC1 PC2 PC3 PC4
Sepal.Length 0.8901688 -0.36082989 0.27565767 0.03760602
Sepal.Width -0.4601427 -0.88271627 -0.09361987 -0.01777631
Petal.Length 0.9915552 -0.02341519 -0.05444699 -0.11534978
Petal.Width 0.9649790 -0.06399985 -0.24298265 0.07535950
Which shows that PC1
is highly correlated to Sepal.Length
, Petal.Length
, and Petal.Width
, and moderately negatively correlated with Sepal.Width
. PC4
is not highly correlated with anything, which is not surprising since it is composed of mostly random variation. This is a typical pattern in PCA.
I think there might be a misunderstanding of the way PCA works. If you have, say, n
variables in your original dataset, PCA by definition will identify n
principal components, ordered by the fraction of variability explained (so, PC1 explains the most variability, etc.). You can tell the algorithm how many to report (e.g., just report PC1, or PC1 and PC2, etc.), but the calculation always produces n
PC's.