Search code examples
rfunctiondplyrmaxcross-correlation

How to find the maximum lag in a list of cross correlations in R (ccf)?


This is an example of my data:

id <- c(1,1,1,1,2,2,3,3,3,3,4,4,4)
Affect <- c(0.8, 0.5, NA, 0.8, 0.2, 0.1, 0.7, 1.1, 0.9, 0.5, 0.3, NA, 0.9)
Paranoia <-  c(0.9, 0.6, 0.4, 0.2, 0.1, NA, 0.3, 0.1, 0.9, 1.5, 0.4, 0.1, 0.6)
both <- data.frame(id, Affect, Paranoia)

Now I calculate a cross correlation for each ID seperately, which gives me a list:

library(tseries)
library(dplyr)
library(tidyr)
out <- both %>%
  group_by(id) %>%
  filter(!(all(is.na(Affect))|all(is.na(Paranoia)))) %>% 
  mutate_at(vars(Affect, Paranoia), replace_na, 0) %>% 
  dplyr::summarise(ccfout = list(ccf(Affect, Paranoia, ylim=c(-10, 10), lag.max=5)))

What I want to do now is to find the lag at which the correlation is at its maximum and the correlation value at that point for each ID - tried this, but didn't work, probably because I have the list for each ID:

Find_Max_CCF <- function(Affect,Paranoia)
{
  d<- both %>%
    group_by(id) %>%
    filter(!(all(is.na(Affect))|all(is.na(Paranoia)))) %>% 
    mutate_at(vars(Affect, Paranoia), replace_na, 0) %>% 
    dplyr::summarise(ccfout = list(ccf(Affect, Paranoia, ylim=c(-10, 10))))
  cor = d$acf[,,1]
  lag = d$lag[,,1]
  res = data.frame(cor,lag)
  res_max = res[which.max(res$cor),]
  return(res_max)
}

Find_Max_CCF(both)

The error message is:

1: Unknown or uninitialised column: 'acf'. 
2: Unknown or uninitialised column: 'lag'. 
3: Unknown or uninitialised column: 'acf'. 
4: Unknown or uninitialised column: 'lag'

Do you have any ideas? Thanks a lot in advance.


Solution

  • The problem is that the column ccfout you create contains lists of acf objects, whereas you want them to be dataframes to be able to slice the way you try to.
    I wrote a function ccf_as_df that instead returns lists of data.frame objects, with columns lag and ccf, by extracting those from the acf object that ccf() returns.

    ccf_as_df <- function(x, y) {
      # calculate ccf and return it as a list of a dataframe
      # with columns `lag` and `acf`
      ccf_obj <- ccf(x, y, ylim=c(-10, 10), lag.max=5, plot = F)
      ccf_df <- data.frame(lag = as.vector(ccf_obj$lag), ccf = as.vector(ccf_obj$acf))
      return(list(ccf_df))
    }
    
    out <- both %>%
      group_by(id) %>%
      filter(!(all(is.na(Affect))|all(is.na(Paranoia)))) %>% 
      mutate_at(vars(Affect, Paranoia), replace_na, 0) %>% 
      summarise(ccfout = ccf_as_df(Affect, Paranoia))
    

    Now, the ccfout column contains lists of dataframes, which you can unnest to get a dataframe with three columns: id, lag and ccf.
    This can then be grouped by id to get the maximum ccf and the lag at which this occurs:

    out %>% 
      unnest(ccfout) %>% 
      group_by(id) %>% 
      summarise(max_ccf = max(ccf),
                max_ccf_lag = lag[which.max(ccf)])