I am trying to clean a dataset which includes data for legacy countries. I want to distribute the values for these to successor new country rows based on the proportion that each of those successor countries makes up of the total. Eg. for each variable in 'Successor country X' I want add to the existing value a portion of the 'Old country' value. I have 30 variables in total. After this, I want to remove the 'Old country' rows.
Here is a simplified version of my dataframe (I actually have multiple old countries):
Country Var1 Var2 Var3 ......Var30
1 Old country 30 18 20
2 Successor country 1 5 1 1
3 Successor country 2 10 1 0
4 Successor country 3 15 1 3
And I'm aiming for....
Country Var1 Var2 Var3 ......Var30
1 Successor country 1 10 7 6
2 Successor country 2 20 7 0
3 Successor country 3 30 7 18
It's the sort of thing I could do quite quickly in Excel but I would really like to learn how to do so in R. I appreciate there may be a few steps required. Many thanks in advance.
A tidyverse solution:
First some data, the old var denotes old/ successor countries. Class denotes their relationship.
library(dplyr)
df <- data.table::fread(
"Country Var1 Var2 Var3 class old
'Old country' 30 18 20 1 1
'Successor country 1' 5 1 1 1 0
'Successor country 2' 10 1 0 1 0
'Successor country 3' 15 1 3 1 0
'Old country 2' 60 36 40 2 1
'Successor country 4' 5 1 1 2 0
'Successor country 5' 10 1 0 2 0
'Successor country 6' 15 1 3 2 0
",
header = TRUE,quote = "'")
We can write a function to do the splitting.
splitit <- function(var, old){
# variable for old country
oldvar = var[which(old == 1)]
# calculate proportion
prop <- var
prop[which(old == 0)] <- var[which(old == 0)]/sum(var[which(old == 0)])
return(prop *oldvar+ var)
}
Then we can apply the function splitit
to all variables starting with "Var", and remove the line for the old country.
df %>%
group_by(class) %>%
mutate_at(vars(starts_with("Var")),.funs = funs(splitit(., old))) %>%
filter (old ==0)
#> # A tibble: 6 x 6
#> # Groups: class [2]
#> Country Var1 Var2 Var3 class old
#> <chr> <dbl> <dbl> <dbl> <int> <int>
#> 1 Successor country 1 10 7 6 1 0
#> 2 Successor country 2 20 7 0 1 0
#> 3 Successor country 3 30 7 18 1 0
#> 4 Successor country 4 15 13 11 2 0
#> 5 Successor country 5 30 13 0 2 0
#> 6 Successor country 6 45 13 33 2 0
Edit: code modified to correctly deal with multiple groups.