Search code examples
rdataframedplyrtidyversepanel-data

How to get the difference between groups with a dataframe in long format in R?


Have a simple dataframe with 2 ID's (N = 2) and 2 periods (T = 2), for example:

 year    id    points
   1      1     10
   1      2     12
   2      1     20
   2      2     18

How does one achieves the following dataframe (preferably using dplyr or any tidyverse solution)?

 id    points_difference
  1         10   
  2         6   

Notice that the points_difference column is the difference between each ID in across time (namely T2 - T1).

Additionally, how to generalize for multiple columns and multiple ID (with only 2 periods)?

 year    id    points  scores
   1      1      10      7
   1     ...    ...     ...
   1      N      12      8
   2      1      20      9
   2     ...    ...     ...
   2      N      12      9

 id    points_difference   scores_difference
  1         10                     2
 ...        ...                   ...
  N          0                     1  

Solution

  • If you are on dplyr 1.0.0(or higher), summarise can return multiple rows in output so this will also work if you have more than 2 periods. You can do :

    library(dplyr)
    
    df %>%
      arrange(id, year) %>%
      group_by(id) %>%
      summarise(across(c(points, scores), diff, .names = '{col}_difference'))
    
    #     id points_difference scores_difference
    #  <int>             <int>             <int>
    #1     1                10                 2
    #2     1                -7                 1
    #3     2                 6                 2
    #4     2                -3                 3
    

    data

    df <- structure(list(year = c(1L, 1L, 2L, 2L, 3L, 3L), id = c(1L, 2L, 
    1L, 2L, 1L, 2L), points = c(10L, 12L, 20L, 18L, 13L, 15L), scores = c(2L, 
    3L, 4L, 5L, 5L, 8L)), class = "data.frame", row.names = c(NA, -6L))