I was originally running a PCA to reduce a large number of correlated measures (>10 behaviours) down to fewer variables (in PCA I used the first and second principal components). But this is not appropriate (similar situation to this OP) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC
model instead of PCA to do this (available through the PTAk
package in R) - see Leibovici (2010) for details.
My data is stored as a data.frame
object, where each row is for one individual, that can be sampled multiple times in a year and across their lifetimes.
Sample of my data (data available here):
individual beh1 beh2 beh3 beh4 year
11979 0 0.0333 0 0 2014
12026 0.176 0.0882 0.441 0.0882 2014
12435 0.405 0.189 0 0.243 2014
12524 0 0 1 0 2014
12625 0 0 0 0 2014
12678 0 0 0 0 2014
To use the PTAk
package, the data needs to be converted into an array
. The code to do this is:
my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))
where x
is the number of rows, y
is the number of columns, and z
is the number of arrays.
My general question:
Which components of my
data.frame
should correspond to which measures in thearray
?
My initial guess would be that x
should correspond to the number of individuals sampled (i.e., the number of rows in the original data.frame
), but I am not sure what the y
and z
components should be.
Like this:
my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))
where x
is 5393 individuals, y
is the number of variables (e.g., 4 behaviours), and z
is the number of years (9 years).
This generates 9 arrays
with each individual’s record as the rows, and each variable as a column (identifier, 4 behaviours, and the year of sampling). In theory each array would correspond to a certain year of sampling, but that is currently not the case.
My question in detail:
If this is the correct formatting for my
array
, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are inarray
1, only 2009 inarray
2, etc.)?
Alternatively, if my formatting is wrong, what is the correct array
format for my data and question?
For example, should I group the data into arrays
according to the behaviour (beh1
, beh2
, etc.), so the code looks like:
my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))
where there would be three columns per array
corresponding to the identifier, value for the behaviour, and year of observation? If this is the proper formatting, how would I ensure that the arrays
are divided based on the behaviours rather than the identifier and/or year columns?
First of all in your subset_data the variable individual
and year
need to be discarded (or used in rownames) as they are just identifiers, otherwise in your 'as.vector(subset_data)' they would mixed them up with the data: so use as.vector(subset_data[,-c(1,4)])
Then, look at the little example below:
A=matrix(1:6,c(2,3))
as.vector(A)
is
[1] 1 2 3 4 5 6
So, imagine 2 individuals 3 behaviours that works!
In building A
, dim(A)[1]
is (2
) runs faster than dim(A)[2]
(3
), which extends to arrays.
So now imagine have 4 years X[,,1]
is your first year A
:
X<-array(0,c(2,3,4))
; X[,,1]=A
;
X[,,2]=A*2
; X[,,3]=A*10
, X[,,4]=A/10
Note this could be a way of building your my_df
my_df[,,1]<-subset_data[ subset_data[,4]==2014, -c(1,4) ]
etc.
My point was as.vector(X)
is then
1 2 3 4 5 6 2 4 6 8 10 12 ...
so the first year then the second year etc...
So to come back (or in fact start of ) with a matrix ind x variable
you'll need to permute the data to AA=matrix(aperm(X,c(1,3,2)),c(8,3))
basically 8 is 2 individuals times 4 with 3 variables...
So if you start with that matrix AA
your array will be Array(AA,dim=c(2,4,3))
individual x year x var
So with:
AA=subset_data[,-c(1,4)]
you'll need to say array(AA,dim=c(nb_indi_repeated,9,4))
for 9 years and 4 variables .... but 5393/9
looks like you do not have full exact repetition for all individuals. So you'll need either to select the 'best sample' of the repeated individuals to define the years and the selected individuals or estimate the missing values or do something completely different! This could be defining a repetition not from years but from the series of repeated measures, the next one being either in the same year or later ...