Search code examples
raveragemoving-averagerolling-computation

How to calculate moving average for two years in r


I have a big data frame (900k rows) about mergers and acquisitions (M&As).

The df has four columns: date (when the M&A was completed), target_nation (a company of which country was merged/acquired), acquiror_nation (corporation of which country was the acquiror), and big_corp (whether the acquiror was a big corporation or not, where TRUE means that corporation is big).

Here is a sample of my df:

> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "France", "France", 
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))

> df 

   date target_nation acquiror_nation big_corp_TF
1: 2000        Uganda          France        TRUE
2: 2000        Uganda         Germany       FALSE
3: 2001        Uganda          France        TRUE
4: 2001        Uganda          France       FALSE
5: 2001        Uganda         Germany       FALSE
6: 2002        Uganda          France        TRUE
7: 2002        Uganda          France        TRUE
8: 2002        Uganda         Germany        TRUE

From these data, I want to create a new variable that denotes the share of M&As done by big corporations of specific acquiror nations, counting the average for 2 years. (For my actual exercise, I will count the averages for 5 years, but let's keep things simpler here). So there would be a new variable for France's big corporations, and a new variable for Germany's big corporations.

What I have managed to do so far is to 1) count the total number of M&As in a specific target_nation for a certain year; and 2) count the total number of M&As conducted by a big corporation of a certain acquiror_nation in a specific target_nation in a certain year. I have joined these two dfs to facilitate the calculation of the averages I want. Here's the code I've used and the resulting new df:

##counting total rows for target nations
df2 <- df %>%
 group_by(date, target_nation) %>%
 count(target_nation)

##counting total rows conducted by small or big corps for certain acquiror nations

df3 <- df %>%
  group_by(date, target_nation, acquiror_nation) %>%
  count(big_corp_TF)

##selecting rows that were conducted by big corps

df33 <- df3 %>%
  filter(big_corp_TF == TRUE)

##merging df2 and df33

df4 <- df2 %>%
  left_join(df33, by = c("date" = "date", "target_nation" = "target_nation"))

df4 <- as.data.frame(df4)

> df4

  date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000        Uganda   2          France        TRUE   1
2 2001        Uganda   3          France        TRUE   1
3 2002        Uganda   3          France        TRUE   2
4 2002        Uganda   3         Germany        TRUE   1

n.x here is the total number of M&As (rows) for a specific target_nation in a certain year; n.y is the total number of M&As (rows) conducted by big corporations of specific acquiror_nations in a certain target_nation.

With this new data frame df4 I could now easily calculate the share of M&As conducted by big corporations of a specific acquiror_nation in a certain target_nation in a certain single year. For example, let's count this share for France:

df5 <- df4 %>% 
  filter(acquiror_nation == "France") %>%
  mutate(France_bigcorp_share_1year = n.y / n.x)

  date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.3333333
3 2002        Uganda   3          France        TRUE   2                  0.6666667

However, I cannot figure out how to calculate the share of M&As done by big corporations of specific acquiror nations, counting the average for 2 years.

This is what a desired variable would look like:

  date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.4000000
3 2002        Uganda   3          France        TRUE   2                  0.5000000

Note that the share for 2000 would remain the same, because there is no prior year to make it a 2 year average; for 2001 it would become 0.4 (because (1+1)/(2+3) = 0.4); for 2002 it would become 0.5 (because (1+2)/(3+3) = 0.5).

Do you have an idea on how to write a code that would calculate the average share for two years? I suppose I would need to use a for loop here, but I can't figure out how. Any suggestions would be appreciated.

--

EDIT: AnilGoyal's code works perfectly with the sample data, but my actual data is apparently more messier and I thus wonder if there is a solution to the problem I encounter.

My actual data set sometimes skips a year, or sometimes doesn't include the acquiror_nations that were included in previous rows. Please see a more accurate sample of my actual data:

> df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "France", "France", 
"Germany", "Germany", "Germany", "France", "France", "Germany", 
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA, 
-15L))

> df_new 

    date target_nation acquiror_nation big_corp_TF
 1: 2000        Uganda          France     TRUE
 2: 2000        Uganda         Germany    FALSE
 3: 2001        Uganda          France     TRUE
 4: 2001        Uganda          France    FALSE
 5: 2001        Uganda         Germany    FALSE
 6: 2002        Uganda          France     TRUE
 7: 2002        Uganda          France     TRUE
 8: 2002        Uganda         Germany     TRUE
 9: 2003        Uganda         Germany     TRUE
10: 2003        Uganda         Germany    FALSE
11: 2004        Uganda          France     TRUE
12: 2004        Uganda          France    FALSE
13: 2004        Uganda         Germany     TRUE
14: 2006        Uganda          France     TRUE
15: 2006        Uganda          France     TRUE

NB: There are no rows for France in 2003; and there is no year 2005.

If I run Anil's first code, the result is the following tibble:

   date target_nation acquiror_nation    n1    n2 share
  <int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2004 Uganda        France              3     1   0.5
5  2006 Uganda        France              2     2   0.6

NB: there is no result for France for 2003 and 2005; I would like there to be results for 2003 and 2005 (because we are calculating 2-year averages and thus we should be able to have results for 2003 and 2005). Also, the share for 2006 is incorrect in reality, because it should be 1 (it should take the values of 2005 (which are 0s) rather than the values of 2004 for the calculation of average).

I would like to be able to receive the following tibble:

       date target_nation acquiror_nation    n1    n2 share
      <int> <chr>         <chr>           <dbl> <int> <dbl>
    1  2000 Uganda        France              2     1   0.5
    2  2001 Uganda        France              3     1   0.4
    3  2002 Uganda        France              3     2   0.5
    4  2003 Uganda        France              2     0   0.4
    5  2004 Uganda        France              3     1   0.2
    6  2005 Uganda        France              0     0   0.33
    7  2006 Uganda        France              2     2   1.0

NB: notice that the result for 2006 is also different (because we now take 2005 instead of 2004 for a two-year average).

Do you think it's possible to find a way to output the desired tibble? I understand that this is a problem with the original data: it simply lacks certain data points. However, including them to the original data set seems to be highly inconvenient; it is probably better to include them mid-way, e.g. after counting the n1 and n2. But what is the most convenient way to do this?

EDIT2: Anil's new code works well with the data sample above, but it runs into an undesired issue when dealing with a more complex data sample (that includes more than one target_nation). Here is a shorter but more complex data sample:

> df_new_complex <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2003L, 
2003L, 1999L, 2001L, 2002L, 2002L), target_nation = c("Uganda", 
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Mozambique", 
"Mozambique", "Mozambique", "Mozambique"), acquiror_nation = c("France", 
"Germany", "France", "France", "Germany", "Germany", "Germany", 
"Germany", "France", "France", "Germany"), big_corp_TF = c(TRUE, 
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE
)), row.names = c(NA, -11L))

> df_new_complex 

date target_nation acquiror_nation big_corp_TF
 1: 2000        Uganda          France        TRUE
 2: 2000        Uganda         Germany       FALSE
 3: 2001        Uganda          France        TRUE
 4: 2001        Uganda          France       FALSE
 5: 2001        Uganda         Germany       FALSE
 6: 2003        Uganda         Germany        TRUE
 7: 2003        Uganda         Germany       FALSE
 8: 1999    Mozambique         Germany       FALSE
 9: 2001    Mozambique          France        TRUE
10: 2002    Mozambique          France       FALSE
11: 2002    Mozambique         Germany        TRUE

As you see, this data sample includes two target_nations. Anil's code, where param <- c("France", "Germany"), produces the following tibble:

    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0    
 2  1999 Mozambique    Germany             1     0 0    
 3  1999 Uganda        France              0     0 0    
 4  1999 Uganda        Germany             0     0 0    
 5  2000 Mozambique    France              0     0 0    
 6  2000 Mozambique    Germany             0     0 0    
 7  2000 Uganda        France              2     1 0.25 
 8  2000 Uganda        Germany             2     0 0.167
 9  2001 Mozambique    France              1     1 0.4  
10  2001 Mozambique    Germany             1     0 0.333
11  2001 Uganda        France              3     1 0.333
12  2001 Uganda        Germany             3     0 0.25 
13  2002 Mozambique    France              2     0 0.2  
14  2002 Mozambique    Germany             2     1 0.25 
15  2002 Uganda        France              0     0 0.25 
16  2002 Uganda        Germany             0     0 0.25 
17  2003 Mozambique    France              0     0 0.25 
18  2003 Mozambique    Germany             0     0 0.25 
19  2003 Uganda        France              2     0 0.167
20  2003 Uganda        Germany             2     1 0.25 

What's undesired here is the fact that the code creates a year 1999 for Uganda, and a year 2003 for Mozambique (the latter is less of an issue). In year 1999, Uganda had no investments as shown in the data sample, so it doesn't make sense to have numeric values for that (it could have NAs, or not be there at all). Mozambique also didn't have investments in year 2003, so I don't want to calculate the shares for that year for Mozambique.

I have found a workaround for this, whereby I filter for a specific target nation early in the code, just like so:

correct1 <- df_new_complex %>% 
  filter(target_nation == "Mozambique") %>%
  mutate(d = 1) %>% ...

#I do the same for another target_nation

correct2 <- df_new_complex %>% 
  filter(target_nation == "Uganda") %>%
  mutate(d = 1) %>% ...

#I then use rbind

correct <- rbind(correct1, correct2)

#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).

> correct 

date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0    
 2  1999 Mozambique    Germany             1     0 0    
 3  2000 Mozambique    France              0     0 0    
 4  2000 Mozambique    Germany             0     0 0    
 5  2001 Mozambique    France              1     1 1    
 6  2001 Mozambique    Germany             1     0 0 
 7  2002 Mozambique    France              2     0 0.33 
 8  2002 Mozambique    Germany             2     1 0.333
 9  2000 Uganda        France              2     1 0.5  
10  2000 Uganda        Germany             2     0 0.25 
11  2001 Uganda        France              3     1 0.286
12  2001 Uganda        Germany             3     0 0.2  
13  2002 Uganda        France              0     0 0.167
14  2002 Uganda        Germany             0     0 0.167
15  2003 Uganda        France              2     0 0    
16  2003 Uganda        Germany             2     1 0.25 

What would be a quicker way to do this? I have a list of desired target_nations. Perhaps it would be possible to create a loop, whereby I do calculations for one target_nation, then another one; then rbind them; then another one; then rbind, etc. Or is there a better way?


Solution

  • With package runner you can do something like this

    df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
                                  2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda", 
                                                                   "Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
                                                                                                                                          "Germany", "France", "France", "Germany", "France", "France", 
                                                                                                                                          "Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
                                                                                                                                                                      TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
    
    library(runner)
    library(tidyverse)
    df <- df %>% as.data.frame()
    param <- 'France'
    df %>% 
      group_by(date, target_nation) %>%
      mutate(n1 = n()) %>%
      group_by(date, target_nation, acquiror_nation) %>%
      summarise(n1 = mean(n1),
                n2 = sum(big_corp_TF), .groups = 'drop') %>%
      filter(acquiror_nation == param) %>%
      mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
    #> # A tibble: 3 x 6
    #>    date target_nation acquiror_nation    n1    n2 share
    #>   <int> <chr>         <chr>           <dbl> <int> <dbl>
    #> 1  2000 Uganda        France              2     1   0.5
    #> 2  2001 Uganda        France              3     1   0.4
    #> 3  2002 Uganda        France              3     2   0.5
    

    Even you can do for all nations simultaneously

    
    df %>% 
      group_by(date, target_nation) %>%
      mutate(n1 = n()) %>%
      group_by(date, target_nation, acquiror_nation) %>%
      summarise(n1 = mean(n1),
                n2 = sum(big_corp_TF), .groups = 'drop') %>%
      group_by(acquiror_nation) %>%
      mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
    #> # A tibble: 6 x 6
    #> # Groups:   acquiror_nation [2]
    #>    date target_nation acquiror_nation    n1    n2 share
    #>   <int> <chr>         <chr>           <dbl> <int> <dbl>
    #> 1  2000 Uganda        France              2     1 0.5  
    #> 2  2000 Uganda        Germany             2     0 0    
    #> 3  2001 Uganda        France              3     1 0.4  
    #> 4  2001 Uganda        Germany             3     0 0    
    #> 5  2002 Uganda        France              3     2 0.5  
    #> 6  2002 Uganda        Germany             3     1 0.167
    

    In view of revised scenario, you need to do 2 things -

    • include argument idx = date in both sum_run functions. This would correct the output as desired but won't include share for missing rows/years.
    • To include missing years too you'll need tidyr::complete as shown below-
    param <- 'France'
    df_new %>% 
      mutate(d = 1) %>%
      complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
               fill = list(d =0, big_corp_TF = FALSE)) %>%
      group_by(date, target_nation) %>%
      mutate(n1 = sum(d)) %>%
      group_by(date, target_nation, acquiror_nation) %>%
      summarise(n1 = mean(n1),
                n2 = sum(big_corp_TF), .groups = 'drop') %>%
      filter(acquiror_nation == param) %>%
      mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
    
    # A tibble: 7 x 6
       date target_nation acquiror_nation    n1    n2 share
      <dbl> <chr>         <chr>           <dbl> <int> <dbl>
    1  2000 Uganda        France              2     1 0.5  
    2  2001 Uganda        France              3     1 0.4  
    3  2002 Uganda        France              3     2 0.5  
    4  2003 Uganda        France              2     0 0.4  
    5  2004 Uganda        France              3     1 0.2  
    6  2005 Uganda        France              0     0 0.333
    7  2006 Uganda        France              2     2 1
    

    Similar to above, you can do it for all nations at once (replcae filter by group_by)

    df_new %>% 
      mutate(d = 1) %>%
      complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
               fill = list(d =0, big_corp_TF = FALSE)) %>%
      group_by(date, target_nation) %>%
      mutate(n1 = sum(d)) %>%
      group_by(date, target_nation, acquiror_nation) %>%
      summarise(n1 = mean(n1),
                n2 = sum(big_corp_TF), .groups = 'drop') %>%
      group_by(acquiror_nation) %>%
      mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
    
    # A tibble: 14 x 6
    # Groups:   acquiror_nation [2]
        date target_nation acquiror_nation    n1    n2 share
       <dbl> <chr>         <chr>           <dbl> <int> <dbl>
     1  2000 Uganda        France              2     1 0.5  
     2  2000 Uganda        Germany             2     0 0    
     3  2001 Uganda        France              3     1 0.4  
     4  2001 Uganda        Germany             3     0 0    
     5  2002 Uganda        France              3     2 0.5  
     6  2002 Uganda        Germany             3     1 0.167
     7  2003 Uganda        France              2     0 0.4  
     8  2003 Uganda        Germany             2     1 0.4  
     9  2004 Uganda        France              3     1 0.2  
    10  2004 Uganda        Germany             3     1 0.4  
    11  2005 Uganda        France              0     0 0.333
    12  2005 Uganda        Germany             0     0 0.333
    13  2006 Uganda        France              2     2 1    
    14  2006 Uganda        Germany             2     0 0
    

    Further edit

    • It is very easy. Remove, target_nation from nesting and add a group_by on it before complete.

    Simple. Isn't it

    df_new_complex %>%
      mutate(d = 1) %>%
      group_by(target_nation) %>%
      complete(date = seq(min(date), max(date), 1), nesting(acquiror_nation),
               fill = list(d =0, big_corp_TF = FALSE)) %>%
      group_by(date, target_nation) %>%
      mutate(n1 = sum(d)) %>%
      group_by(date, target_nation, acquiror_nation) %>%
      summarise(n1 = mean(n1),
                n2 = sum(big_corp_TF), .groups = 'drop') %>%
      group_by(acquiror_nation) %>%
      mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
    
    # A tibble: 16 x 6
    # Groups:   acquiror_nation [2]
        date target_nation acquiror_nation    n1    n2 share
       <dbl> <chr>         <chr>           <dbl> <int> <dbl>
     1  1999 Mozambique    France              1     0 0    
     2  1999 Mozambique    Germany             1     0 0    
     3  2000 Mozambique    France              0     0 0    
     4  2000 Mozambique    Germany             0     0 0    
     5  2000 Uganda        France              2     1 0.5  
     6  2000 Uganda        Germany             2     0 0    
     7  2001 Mozambique    France              1     1 0.667
     8  2001 Mozambique    Germany             1     0 0    
     9  2001 Uganda        France              3     1 0.5  
    10  2001 Uganda        Germany             3     0 0    
    11  2002 Mozambique    France              2     0 0.2  
    12  2002 Mozambique    Germany             2     1 0.2  
    13  2002 Uganda        France              0     0 0    
    14  2002 Uganda        Germany             0     0 0.5  
    15  2003 Uganda        France              2     0 0    
    16  2003 Uganda        Germany             2     1 0.5