Search code examples
rdataframerow

Cleaning legacy country records in R dataframe


I am trying to clean a dataset which includes data for legacy countries. I want to distribute the values for these to successor new country rows based on the proportion that each of those successor countries makes up of the total. Eg. for each variable in 'Successor country X' I want add to the existing value a portion of the 'Old country' value. I have 30 variables in total. After this, I want to remove the 'Old country' rows.

Here is a simplified version of my dataframe (I actually have multiple old countries):

Country                   Var1    Var2   Var3 ......Var30
1   Old country           30      18     20       
2   Successor country 1   5       1      1
3   Successor country 2   10      1      0
4   Successor country 3   15      1      3

And I'm aiming for....

Country                   Var1    Var2   Var3 ......Var30
1   Successor country 1   10      7      6
2   Successor country 2   20      7      0
3   Successor country 3   30      7      18

It's the sort of thing I could do quite quickly in Excel but I would really like to learn how to do so in R. I appreciate there may be a few steps required. Many thanks in advance.


Solution

  • A tidyverse solution:

    First some data, the old var denotes old/ successor countries. Class denotes their relationship.

    library(dplyr)
    df <- data.table::fread( 
    "Country       Var1    Var2   Var3 class  old 
    'Old country'           30      18     20  1  1    
    'Successor country 1'   5       1      1   1  0
    'Successor country 2'   10      1      0   1  0
    'Successor country 3'   15      1      3   1  0
    'Old country 2'         60      36     40  2  1    
    'Successor country 4'   5       1      1   2  0
    'Successor country 5'   10      1      0   2  0
    'Successor country 6'   15      1      3   2  0
    ",
    header = TRUE,quote = "'")
    

    We can write a function to do the splitting.

    splitit <- function(var, old){
      # variable for old country
      oldvar = var[which(old == 1)]
      # calculate proportion
      prop <- var
      prop[which(old == 0)] <- var[which(old == 0)]/sum(var[which(old == 0)]) 
      return(prop *oldvar+ var)
    }
    

    Then we can apply the function splitit to all variables starting with "Var", and remove the line for the old country.

    df %>%
      group_by(class) %>%
      mutate_at(vars(starts_with("Var")),.funs = funs(splitit(., old))) %>% 
      filter (old ==0)
    #> # A tibble: 6 x 6
    #> # Groups:   class [2]
    #>   Country              Var1  Var2  Var3 class   old
    #>   <chr>               <dbl> <dbl> <dbl> <int> <int>
    #> 1 Successor country 1    10     7     6     1     0
    #> 2 Successor country 2    20     7     0     1     0
    #> 3 Successor country 3    30     7    18     1     0
    #> 4 Successor country 4    15    13    11     2     0
    #> 5 Successor country 5    30    13     0     2     0
    #> 6 Successor country 6    45    13    33     2     0
    

    Edit: code modified to correctly deal with multiple groups.