Search code examples
rdplyrtime-seriespanel-data

Calculating the percentage of matching observation from one period to another in a panel data


I have a time-series panel dataset that is structured in the following way: There are multiple funds that each own multiple stocks and we have a value column for the stock. As you can see the panel is not balanced. My actual dataset is very large with each fund having at least 500 stocks and different quarters being represented with some having missing quarter values.

df <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
  stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
           "2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
  value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)


> df
   fund_id stock_id  year_q value
1        1        1 2011-03     1
2        1        1 2011-06     2
3        1        1 2011-09     1
4        1        1 2011-12     3
5        1        1 2012-03     4
6        1        1 2012-06     2
7        1        2 2011-12     1
8        1        2 2012-03     2
9        1        2 2012-06     3
10       1        2 2012-09     4
11       1        2 2012-12     2
12       1        2 2013-03     1
13       1        2 2013-06     3
14       2        1 2014-09     1
15       2        1 2015-03     1
16       2        3 2013-03     3
17       2        3 2013-06     2
18       2        3 2013-09     3
19       2        3 2013-12     1

I would like to calculate for each fund, the percentage of stocks held in the current quarter that were ever held in the previous one to 3 quarters. So basically for every fund and every date, I would like to have 3 columns with past 1 Q, past 2Q and past 3Q which show what percentage of stocks held on that date were also present in each of those past quarters.
Here is what the result should look like:

result <- data.frame(
  fund_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
  year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
               "2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
  past_1Q = c("NA",1,1,0.5,1,1,0.5,1,1,1,"NA",1,1,1,"NA","NA","NA","NA","NA"),
  past_2Q = c("NA","NA",1,0.5,0.5,1,0.5,0.5,1,1,"NA","NA",1,1,"NA","NA","NA","NA","NA"),
  past_3Q = c("NA","NA","NA",0.5,0.5,0.5,0.5,0.5,0.5,1,"NA","NA","NA",1,"NA","NA","NA","NA","NA")
)

> result
   fund_id  year_q past_1Q past_2Q past_3Q
1        1 2011-03      NA      NA      NA
2        1 2011-06       1      NA      NA
3        1 2011-09       1       1      NA
4        1 2011-12     0.5     0.5     0.5
5        1 2012-03       1     0.5     0.5
6        1 2012-06       1       1     0.5
7        1 2012-09     0.5     0.5     0.5
8        1 2012-12       1     0.5     0.5
9        1 2013-03       1       1     0.5
10       1 2013-06       1       1       1
11       2 2013-03      NA      NA      NA
12       2 2013-06       1      NA      NA
13       2 2013-09       1       1      NA
14       2 2013-12       1       1       1
15       2 2014-03      NA      NA      NA
16       2 2014-06      NA      NA      NA
17       2 2014-09      NA      NA      NA
18       2 2014-12      NA      NA      NA
19       2 2015-03      NA      NA      NA



I tried to do this using rollapply but can't get the correct results. I understand that this might not be the best sample data but in my real data each fund usually have more than 500 stocks and I expect the percentage of matching stocks from one period to the past periods to be something around 0.95 on average.

This is what I have to get the first two result columns (credits to @r2evans):

result <- df %>%
  group_by(fund_id) %>%
  mutate(miny = min(year_q), maxy = max(year_q)) %>%
  distinct(fund_id, miny, maxy) %>%
  group_by(fund_id) %>%
  mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
  transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m")))  %>%
  tidyr::unnest(year_q) %>%
  full_join(df, by = c("fund_id", "year_q")) %>%
  distinct(fund_id, year_q) %>%
  arrange(fund_id, year_q)

Solution

  • library(tidyverse)
    
    df %>%
      mutate(year_q = as.Date(paste0(year_q, '-01'))) %>%
      group_by(fund_id, year_q) %>%
      summarise(stock_id = list(unique(stock_id))) %>%
      complete(year_q = seq(min(year_q), max(year_q), by = "3 months")) %>%
      reduce(.init = ., 1:3, ~ mutate(.x, "past_{.y}Q" := map(1:n(), \(N) unlist(stock_id[pmax(N-.y, 0)])))) %>%
      mutate(across(contains("past"), \(past) map2_dbl(stock_id, past, ~ mean(.x %in% .y)) %>% replace_na(0))) %>%
      ungroup()
    
    # A tibble: 19 × 6
       fund_id year_q     stock_id  past_1Q past_2Q past_3Q
         <dbl> <date>     <list>      <dbl>   <dbl>   <dbl>
     1       1 2011-03-01 <dbl [1]>     0       0       0  
     2       1 2011-06-01 <dbl [1]>     1       0       0  
     3       1 2011-09-01 <dbl [1]>     1       1       0  
     4       1 2011-12-01 <dbl [2]>     0.5     0.5     0.5
     5       1 2012-03-01 <dbl [2]>     1       0.5     0.5
     6       1 2012-06-01 <dbl [2]>     1       1       0.5
     7       1 2012-09-01 <dbl [1]>     1       1       1  
     8       1 2012-12-01 <dbl [1]>     1       1       1  
     9       1 2013-03-01 <dbl [1]>     1       1       1  
    10       1 2013-06-01 <dbl [1]>     1       1       1  
    11       2 2013-03-01 <dbl [1]>     0       0       0  
    12       2 2013-06-01 <dbl [1]>     1       0       0  
    13       2 2013-09-01 <dbl [1]>     1       1       0  
    14       2 2013-12-01 <dbl [1]>     1       1       1  
    15       2 2014-03-01 <NULL>        0       0       0  
    16       2 2014-06-01 <NULL>        0       0       0  
    17       2 2014-09-01 <dbl [1]>     0       0       0  
    18       2 2014-12-01 <NULL>        0       0       0  
    19       2 2015-03-01 <dbl [1]>     0       1       0