I have a big data frame (900k rows) about mergers and acquisitions (M&As).
The df has four columns: date (when the M&A was completed), target_nation (a company of which country was merged/acquired), acquiror_nation (corporation of which country was the acquiror), and big_corp (whether the acquiror was a big corporation or not, where TRUE means that corporation is big).
Here is a sample of my df:
> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
> df
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
From these data, I want to create a new variable that denotes the share of M&As done by big corporations of specific acquiror nations, counting the average for 2 years. (For my actual exercise, I will count the averages for 5 years, but let's keep things simpler here). So there would be a new variable for France's big corporations, and a new variable for Germany's big corporations.
What I have managed to do so far is to 1) count the total number of M&As in a specific target_nation for a certain year; and 2) count the total number of M&As conducted by a big corporation of a certain acquiror_nation in a specific target_nation in a certain year. I have joined these two dfs to facilitate the calculation of the averages I want. Here's the code I've used and the resulting new df:
##counting total rows for target nations
df2 <- df %>%
group_by(date, target_nation) %>%
count(target_nation)
##counting total rows conducted by small or big corps for certain acquiror nations
df3 <- df %>%
group_by(date, target_nation, acquiror_nation) %>%
count(big_corp_TF)
##selecting rows that were conducted by big corps
df33 <- df3 %>%
filter(big_corp_TF == TRUE)
##merging df2 and df33
df4 <- df2 %>%
left_join(df33, by = c("date" = "date", "target_nation" = "target_nation"))
df4 <- as.data.frame(df4)
> df4
date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000 Uganda 2 France TRUE 1
2 2001 Uganda 3 France TRUE 1
3 2002 Uganda 3 France TRUE 2
4 2002 Uganda 3 Germany TRUE 1
n.x here is the total number of M&As (rows) for a specific target_nation in a certain year; n.y is the total number of M&As (rows) conducted by big corporations of specific acquiror_nations in a certain target_nation.
With this new data frame df4 I could now easily calculate the share of M&As conducted by big corporations of a specific acquiror_nation in a certain target_nation in a certain single year. For example, let's count this share for France:
df5 <- df4 %>%
filter(acquiror_nation == "France") %>%
mutate(France_bigcorp_share_1year = n.y / n.x)
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000 Uganda 2 France TRUE 1 0.5000000
2 2001 Uganda 3 France TRUE 1 0.3333333
3 2002 Uganda 3 France TRUE 2 0.6666667
However, I cannot figure out how to calculate the share of M&As done by big corporations of specific acquiror nations, counting the average for 2 years.
This is what a desired variable would look like:
date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000 Uganda 2 France TRUE 1 0.5000000
2 2001 Uganda 3 France TRUE 1 0.4000000
3 2002 Uganda 3 France TRUE 2 0.5000000
Note that the share for 2000 would remain the same, because there is no prior year to make it a 2 year average; for 2001 it would become 0.4 (because (1+1)/(2+3) = 0.4); for 2002 it would become 0.5 (because (1+2)/(3+3) = 0.5).
Do you have an idea on how to write a code that would calculate the average share for two years? I suppose I would need to use a for loop here, but I can't figure out how. Any suggestions would be appreciated.
--
EDIT: AnilGoyal's code works perfectly with the sample data, but my actual data is apparently more messier and I thus wonder if there is a solution to the problem I encounter.
My actual data set sometimes skips a year, or sometimes doesn't include the acquiror_nations that were included in previous rows. Please see a more accurate sample of my actual data:
> df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))
> df_new
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2002 Uganda France TRUE
7: 2002 Uganda France TRUE
8: 2002 Uganda Germany TRUE
9: 2003 Uganda Germany TRUE
10: 2003 Uganda Germany FALSE
11: 2004 Uganda France TRUE
12: 2004 Uganda France FALSE
13: 2004 Uganda Germany TRUE
14: 2006 Uganda France TRUE
15: 2006 Uganda France TRUE
NB: There are no rows for France in 2003; and there is no year 2005.
If I run Anil's first code, the result is the following tibble:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2004 Uganda France 3 1 0.5
5 2006 Uganda France 2 2 0.6
NB: there is no result for France for 2003 and 2005; I would like there to be results for 2003 and 2005 (because we are calculating 2-year averages and thus we should be able to have results for 2003 and 2005). Also, the share for 2006 is incorrect in reality, because it should be 1 (it should take the values of 2005 (which are 0s) rather than the values of 2004 for the calculation of average).
I would like to be able to receive the following tibble:
date target_nation acquiror_nation n1 n2 share
<int> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.33
7 2006 Uganda France 2 2 1.0
NB: notice that the result for 2006 is also different (because we now take 2005 instead of 2004 for a two-year average).
Do you think it's possible to find a way to output the desired tibble? I understand that this is a problem with the original data: it simply lacks certain data points. However, including them to the original data set seems to be highly inconvenient; it is probably better to include them mid-way, e.g. after counting the n1 and n2. But what is the most convenient way to do this?
EDIT2: Anil's new code works well with the data sample above, but it runs into an undesired issue when dealing with a more complex data sample (that includes more than one target_nation). Here is a shorter but more complex data sample:
> df_new_complex <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2003L,
2003L, 1999L, 2001L, 2002L, 2002L), target_nation = c("Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Mozambique",
"Mozambique", "Mozambique", "Mozambique"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "Germany", "Germany",
"Germany", "France", "France", "Germany"), big_corp_TF = c(TRUE,
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE
)), row.names = c(NA, -11L))
> df_new_complex
date target_nation acquiror_nation big_corp_TF
1: 2000 Uganda France TRUE
2: 2000 Uganda Germany FALSE
3: 2001 Uganda France TRUE
4: 2001 Uganda France FALSE
5: 2001 Uganda Germany FALSE
6: 2003 Uganda Germany TRUE
7: 2003 Uganda Germany FALSE
8: 1999 Mozambique Germany FALSE
9: 2001 Mozambique France TRUE
10: 2002 Mozambique France FALSE
11: 2002 Mozambique Germany TRUE
As you see, this data sample includes two target_nations. Anil's code, where param <- c("France", "Germany")
, produces the following tibble:
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 1999 Uganda France 0 0 0
4 1999 Uganda Germany 0 0 0
5 2000 Mozambique France 0 0 0
6 2000 Mozambique Germany 0 0 0
7 2000 Uganda France 2 1 0.25
8 2000 Uganda Germany 2 0 0.167
9 2001 Mozambique France 1 1 0.4
10 2001 Mozambique Germany 1 0 0.333
11 2001 Uganda France 3 1 0.333
12 2001 Uganda Germany 3 0 0.25
13 2002 Mozambique France 2 0 0.2
14 2002 Mozambique Germany 2 1 0.25
15 2002 Uganda France 0 0 0.25
16 2002 Uganda Germany 0 0 0.25
17 2003 Mozambique France 0 0 0.25
18 2003 Mozambique Germany 0 0 0.25
19 2003 Uganda France 2 0 0.167
20 2003 Uganda Germany 2 1 0.25
What's undesired here is the fact that the code creates a year 1999 for Uganda, and a year 2003 for Mozambique (the latter is less of an issue). In year 1999, Uganda had no investments as shown in the data sample, so it doesn't make sense to have numeric values for that (it could have NAs, or not be there at all). Mozambique also didn't have investments in year 2003, so I don't want to calculate the shares for that year for Mozambique.
I have found a workaround for this, whereby I filter for a specific target nation early in the code, just like so:
correct1 <- df_new_complex %>%
filter(target_nation == "Mozambique") %>%
mutate(d = 1) %>% ...
#I do the same for another target_nation
correct2 <- df_new_complex %>%
filter(target_nation == "Uganda") %>%
mutate(d = 1) %>% ...
#I then use rbind
correct <- rbind(correct1, correct2)
#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).
> correct
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 2000 Mozambique France 0 0 0
4 2000 Mozambique Germany 0 0 0
5 2001 Mozambique France 1 1 1
6 2001 Mozambique Germany 1 0 0
7 2002 Mozambique France 2 0 0.33
8 2002 Mozambique Germany 2 1 0.333
9 2000 Uganda France 2 1 0.5
10 2000 Uganda Germany 2 0 0.25
11 2001 Uganda France 3 1 0.286
12 2001 Uganda Germany 3 0 0.2
13 2002 Uganda France 0 0 0.167
14 2002 Uganda Germany 0 0 0.167
15 2003 Uganda France 2 0 0
16 2003 Uganda Germany 2 1 0.25
What would be a quicker way to do this? I have a list of desired target_nations. Perhaps it would be possible to create a loop, whereby I do calculations for one target_nation, then another one; then rbind them; then another one; then rbind, etc. Or is there a better way?
With package runner
you can do something like this
df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))
library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 3 x 6
#> date target_nation acquiror_nation n1 n2 share
#> <int> <chr> <chr> <dbl> <int> <dbl>
#> 1 2000 Uganda France 2 1 0.5
#> 2 2001 Uganda France 3 1 0.4
#> 3 2002 Uganda France 3 2 0.5
Even you can do for all nations simultaneously
df %>%
group_by(date, target_nation) %>%
mutate(n1 = n()) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 6 x 6
#> # Groups: acquiror_nation [2]
#> date target_nation acquiror_nation n1 n2 share
#> <int> <chr> <chr> <dbl> <int> <dbl>
#> 1 2000 Uganda France 2 1 0.5
#> 2 2000 Uganda Germany 2 0 0
#> 3 2001 Uganda France 3 1 0.4
#> 4 2001 Uganda Germany 3 0 0
#> 5 2002 Uganda France 3 2 0.5
#> 6 2002 Uganda Germany 3 1 0.167
In view of revised scenario, you need to do 2 things -
idx = date
in both sum_run
functions. This would correct the output as desired but won't include share for missing rows/years.tidyr::complete
as shown below-param <- 'France'
df_new %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
filter(acquiror_nation == param) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 7 x 6
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2001 Uganda France 3 1 0.4
3 2002 Uganda France 3 2 0.5
4 2003 Uganda France 2 0 0.4
5 2004 Uganda France 3 1 0.2
6 2005 Uganda France 0 0 0.333
7 2006 Uganda France 2 2 1
Similar to above, you can do it for all nations at once (replcae filter by group_by)
df_new %>%
mutate(d = 1) %>%
complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))
# A tibble: 14 x 6
# Groups: acquiror_nation [2]
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 2000 Uganda France 2 1 0.5
2 2000 Uganda Germany 2 0 0
3 2001 Uganda France 3 1 0.4
4 2001 Uganda Germany 3 0 0
5 2002 Uganda France 3 2 0.5
6 2002 Uganda Germany 3 1 0.167
7 2003 Uganda France 2 0 0.4
8 2003 Uganda Germany 2 1 0.4
9 2004 Uganda France 3 1 0.2
10 2004 Uganda Germany 3 1 0.4
11 2005 Uganda France 0 0 0.333
12 2005 Uganda Germany 0 0 0.333
13 2006 Uganda France 2 2 1
14 2006 Uganda Germany 2 0 0
Further edit
target_nation
from nesting
and add a group_by
on it before complete
.Simple. Isn't it
df_new_complex %>%
mutate(d = 1) %>%
group_by(target_nation) %>%
complete(date = seq(min(date), max(date), 1), nesting(acquiror_nation),
fill = list(d =0, big_corp_TF = FALSE)) %>%
group_by(date, target_nation) %>%
mutate(n1 = sum(d)) %>%
group_by(date, target_nation, acquiror_nation) %>%
summarise(n1 = mean(n1),
n2 = sum(big_corp_TF), .groups = 'drop') %>%
group_by(acquiror_nation) %>%
mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
# A tibble: 16 x 6
# Groups: acquiror_nation [2]
date target_nation acquiror_nation n1 n2 share
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 1999 Mozambique France 1 0 0
2 1999 Mozambique Germany 1 0 0
3 2000 Mozambique France 0 0 0
4 2000 Mozambique Germany 0 0 0
5 2000 Uganda France 2 1 0.5
6 2000 Uganda Germany 2 0 0
7 2001 Mozambique France 1 1 0.667
8 2001 Mozambique Germany 1 0 0
9 2001 Uganda France 3 1 0.5
10 2001 Uganda Germany 3 0 0
11 2002 Mozambique France 2 0 0.2
12 2002 Mozambique Germany 2 1 0.2
13 2002 Uganda France 0 0 0
14 2002 Uganda Germany 0 0 0.5
15 2003 Uganda France 2 0 0
16 2003 Uganda Germany 2 1 0.5