I have been able to calculate covariance for my large data set with:
cov(MyMatrix, use="pairwise.complete.obs",method="pearson")
This provided the covariance table I was looking for, as well as dealing with the NA issues that are throughout my data. For a deeper analysis, however, I want to create covariance matrices that deal separately with the 800+ groups I have in my data set (some have 40+ observations, others only 1). I tried (from http://www.mail-archive.com/r-help@r-project.org/msg86328.html):
lapply(list(cov), by, data = MyMatrix[8:13], INDICES = MyMatrix["Group"])
Which gave me the following error:
Error in tapply(seq_len(6L), list(
MyMatrix["Group"]
= NA_real_), function (x) : arguments must have same length
This made me think the issue with the code involved the missing NA data, so I tried incorporating the "use="pairwise.complete.obs",method="pearson"" phrase into the lapply code and can't get it to work. I'm not sure the best place for it, so I tried sticking it everywhere:
lapply(list(cov), use="pairwise.complete.obs",method="pearson"),by,data=MyMatrix[8:13], INDICES = MyMatrix["Group"])
lapply(list(cov),by,data=PhenoMtrix[8:13], INDICES = PhenoMtrix["Group"], use="pairwise.complete.obs",method="pearson")
This is obviously sloppy and doesn't work, so I'm a little stuck. Thanks in advance for your help!
My data is formatted as such:
Group HML RML FML TML FHD BIB
1 323.50 248.75 434.50 355.75 46.84 NA 2 NA 238.50 441.50 353.00 45.83 277.0 2 309.50 227.75 419.00 332.25 46.39 284.0
This would be much better if you provided an example of your data (or all of it), but since you didn't,
# create sample data
set.seed(1)
MyMatrix <- data.frame(group=rep(1:5, each=100),matrix(rnorm(2500),ncol=5))
# generate list of covariance matrices by group
cov.list <- lapply(unique(MyMatrix$group),
function(x)cov(MyMatrix[MyMatrix$group==x,-1],
use="na.or.complete"))
cov.list[1]
# [[1]]
# X1 X2 X3 X4 X5
# X1 0.80676209 -0.09541458 -0.12704666 -0.04122976 0.08636307
# X2 -0.09541458 0.93350463 -0.05197573 -0.06457299 -0.02203141
# X3 -0.12704666 -0.05197573 1.06030090 0.07324986 0.01840894
# X4 -0.04122976 -0.06457299 0.07324986 1.12059428 0.02385031
# X5 0.08636307 -0.02203141 0.01840894 0.02385031 1.11101410
In this example we create a dataframe called MyMatrix
with a six columns. The first is group
and the other five are X1, X2, ... X5
and contain the data we wish to correlate. Hopefully, this is similar to the structure of your dataset.
The operative line of code is:
cov.list <- lapply(unique(MyMatrix$group),
function(x)cov(MyMatrix[MyMatrix$group==x,-1],
use="na.or.complete"))
This takes a list of group id's (from unique(MyMatrix$group)
) and calls the function with each of them. The function calculates the covariance matrix for all columns of MyMatrix
except the first, for all rows in the relevant group, and stores the results in a 5-element list (there are 5 groups in this example).
Note: Regarding how to deal with NA. There are actually several options; you should review the documentation on ?cov to see what they are. The method chosen here, use="na.or.complete"
includes in the calculation only rows which have no NA values in any of the columns. If, for a given group, there are no such rows, cov(...)
returns NA. There are several other choices though.