Search code examples
rarrayspca

How to rearrange your data in an array for PARAFAC model from PTAK package in R


I was originally running a PCA to reduce a large number of correlated measures (>10 behaviours) down to fewer variables (in PCA I used the first and second principal components). But this is not appropriate (similar situation to this OP) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC model instead of PCA to do this (available through the PTAk package in R) - see Leibovici (2010) for details.

My data is stored as a data.frame object, where each row is for one individual, that can be sampled multiple times in a year and across their lifetimes.

Sample of my data (data available here):

individual  beh1   beh2     beh3   beh4    year
11979       0      0.0333   0      0       2014
12026       0.176  0.0882   0.441  0.0882  2014
12435       0.405  0.189    0      0.243   2014
12524       0      0        1      0       2014
12625       0      0        0      0       2014
12678       0      0        0      0       2014

To use the PTAk package, the data needs to be converted into an array. The code to do this is:

my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))

where x is the number of rows, y is the number of columns, and z is the number of arrays.

My general question:

Which components of my data.frame should correspond to which measures in the array?

My initial guess would be that x should correspond to the number of individuals sampled (i.e., the number of rows in the original data.frame), but I am not sure what the y and z components should be.

Like this:

my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))

where x is 5393 individuals, y is the number of variables (e.g., 4 behaviours), and z is the number of years (9 years).

This generates 9 arrays with each individual’s record as the rows, and each variable as a column (identifier, 4 behaviours, and the year of sampling). In theory each array would correspond to a certain year of sampling, but that is currently not the case.

My question in detail:

If this is the correct formatting for my array, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are in array 1, only 2009 in array 2, etc.)?

Alternatively, if my formatting is wrong, what is the correct array format for my data and question?

For example, should I group the data into arrays according to the behaviour (beh1, beh2, etc.), so the code looks like:

my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))

where there would be three columns per array corresponding to the identifier, value for the behaviour, and year of observation? If this is the proper formatting, how would I ensure that the arrays are divided based on the behaviours rather than the identifier and/or year columns?


Solution

  • First of all in your subset_data the variable individual and year need to be discarded (or used in rownames) as they are just identifiers, otherwise in your 'as.vector(subset_data)' they would mixed them up with the data: so use as.vector(subset_data[,-c(1,4)])

    Then, look at the little example below: A=matrix(1:6,c(2,3))

    as.vector(A)is [1] 1 2 3 4 5 6

    So, imagine 2 individuals 3 behaviours that works!

    In building A, dim(A)[1] is (2) runs faster than dim(A)[2] (3), which extends to arrays.

    So now imagine have 4 years X[,,1] is your first year A: X<-array(0,c(2,3,4)); X[,,1]=A; X[,,2]=A*2; X[,,3]=A*10, X[,,4]=A/10

    Note this could be a way of building your my_df

    my_df[,,1]<-subset_data[ subset_data[,4]==2014, -c(1,4) ]etc.

    My point was as.vector(X)is then

    1 2 3 4 5 6 2 4 6 8 10 12 ...

    so the first year then the second year etc...

    So to come back (or in fact start of ) with a matrix ind x variable you'll need to permute the data to AA=matrix(aperm(X,c(1,3,2)),c(8,3)) basically 8 is 2 individuals times 4 with 3 variables...

    So if you start with that matrix AA your array will be Array(AA,dim=c(2,4,3)) individual x year x var

    So with: AA=subset_data[,-c(1,4)]

    you'll need to say array(AA,dim=c(nb_indi_repeated,9,4)) for 9 years and 4 variables .... but 5393/9 looks like you do not have full exact repetition for all individuals. So you'll need either to select the 'best sample' of the repeated individuals to define the years and the selected individuals or estimate the missing values or do something completely different! This could be defining a repetition not from years but from the series of repeated measures, the next one being either in the same year or later ...