Search code examples
rdplyrcombn

Differences between all possible pairs of rows for all columns within each level of factor


I want to build all possible pairs of rows in a dataframe within each level of a categorical variable name and then make the differences of these rows within each level of name for all non-factor variables: row 1 - row 2, row 1 - row 3, …

set.seed(9)
df <- data.frame(
  ID = 1:10,
  name = as.factor(rep(LETTERS, each = 4)[1:10]),
  X1 = sample(1001, 10),
  X2 = sample(1001, 10),
  bool = sample(c(TRUE, FALSE), 10, replace = TRUE),
  fruit = as.factor(sample(c("Apple", "Orange", "Kiwi"), 10, replace = TRUE))
)

This is what the sample looks like:

   ID name  X1  X2  bool  fruit
1   1    A 222 118 FALSE  Apple
2   2    A  25   9  TRUE   Kiwi
3   3    A 207 883  TRUE Orange
4   4    A 216 301  TRUE   Kiwi
5   5    B 443 492 FALSE  Apple
6   6    B 134 499 FALSE   Kiwi
7   7    B 389 401  TRUE   Kiwi
8   8    B 368 972  TRUE   Kiwi
9   9    C 665 356 FALSE  Apple
10 10    C 985 488 FALSE   Kiwi

I want to get a dataframe of 13 rows which looks like :

   ID  name  X1   X2  bool  fruit
1  1-2    A 197  109    -1  Apple
2  1-3    A  15 -765    -1   Kiwi
…

Note that the factor fruit should be unchanged. But it is a bonus, I want above all the X1 and X2 to be changed and the factor name to be kept.

I know I may use combn function but I do not see how to do it. I would prefer a solution with the dplyr package and the group_by function.

I've managed to create all differences for consecutives rows with dplyr using

varnotfac <- names(df)[!sapply(df, is.factor )] # remove factorial variable
# but not logical variable

library(dplyr)
diff <- df%>%
  group_by(name) %>%
  mutate_at(varnotfac, funs(. - lead(.))) %>% #      
  na.omit() 

Solution

  • My sample looks different...

       ID name  X1  X2  bool
    1   1    A 222 118 FALSE
    2   2    A  25   9  TRUE
    3   3    A 207 883  TRUE
    4   4    A 216 301  TRUE
    5   5    B 443 492 FALSE
    6   6    B 134 499 FALSE
    7   7    B 389 401  TRUE
    8   8    B 368 972  TRUE
    9   9    C 665 356 FALSE
    10 10    C 985 488 FALSE
    

    Using this, and looking here, we can do:

    library(dplyr)
    library(tidyr)
    library(purrr)
    
    df %>% 
      group_by(name) %>% 
      nest() %>% 
      mutate(data = map(data, ~as.data.frame(map(.x, ~as.numeric(dist(.)))))) %>% 
      unnest()
    
    # A tibble: 13 x 5
       name     ID    X1    X2  bool
       <fct> <dbl> <dbl> <dbl> <dbl>
     1 A         1   197   109     1
     2 A         2    15   765     1
     3 A         3     6   183     1
     4 A         1   182   874     0
     5 A         2   191   292     0
     6 A         1     9   582     0
     7 B         1   309     7     0
     8 B         2    54    91     1
     9 B         3    75   480     1
    10 B         1   255    98     1
    11 B         2   234   473     1
    12 B         1    21   571     0
    13 C         1   320   132     0
    

    This is unsigned though. Alternatively:

    df %>% 
      group_by(name) %>% 
      nest() %>% 
      mutate(data = map(data, ~as.data.frame(map(.x, ~combn(., 2, diff))))) %>% 
      unnest()
    
    # A tibble: 13 x 5
       name     ID    X1    X2  bool
       <fct> <int> <int> <int> <int>
     1 A         1  -197  -109     1
     2 A         2   -15   765     1
     3 A         3    -6   183     1
     4 A         1   182   874     0
     5 A         2   191   292     0
     6 A         1     9  -582     0
     7 B         1  -309     7     0
     8 B         2   -54   -91     1
     9 B         3   -75   480     1
    10 B         1   255   -98     1
    11 B         2   234   473     1
    12 B         1   -21   571     0
    13 C         1   320   132     0