I want to build all possible pairs of rows in a dataframe within each level of a categorical variable name
and then make the differences of these rows within each level of name
for all non-factor variables: row 1 - row 2, row 1 - row 3, …
set.seed(9)
df <- data.frame(
ID = 1:10,
name = as.factor(rep(LETTERS, each = 4)[1:10]),
X1 = sample(1001, 10),
X2 = sample(1001, 10),
bool = sample(c(TRUE, FALSE), 10, replace = TRUE),
fruit = as.factor(sample(c("Apple", "Orange", "Kiwi"), 10, replace = TRUE))
)
This is what the sample looks like:
ID name X1 X2 bool fruit
1 1 A 222 118 FALSE Apple
2 2 A 25 9 TRUE Kiwi
3 3 A 207 883 TRUE Orange
4 4 A 216 301 TRUE Kiwi
5 5 B 443 492 FALSE Apple
6 6 B 134 499 FALSE Kiwi
7 7 B 389 401 TRUE Kiwi
8 8 B 368 972 TRUE Kiwi
9 9 C 665 356 FALSE Apple
10 10 C 985 488 FALSE Kiwi
I want to get a dataframe of 13 rows which looks like :
ID name X1 X2 bool fruit
1 1-2 A 197 109 -1 Apple
2 1-3 A 15 -765 -1 Kiwi
…
Note that the factor fruit
should be unchanged. But it is a bonus, I want above all the X1
and X2
to be changed and the factor name
to be kept.
I know I may use combn
function but I do not see how to do it. I would prefer a solution with the dplyr
package and the group_by
function.
I've managed to create all differences for consecutives rows with dplyr
using
varnotfac <- names(df)[!sapply(df, is.factor )] # remove factorial variable
# but not logical variable
library(dplyr)
diff <- df%>%
group_by(name) %>%
mutate_at(varnotfac, funs(. - lead(.))) %>% #
na.omit()
My sample looks different...
ID name X1 X2 bool 1 1 A 222 118 FALSE 2 2 A 25 9 TRUE 3 3 A 207 883 TRUE 4 4 A 216 301 TRUE 5 5 B 443 492 FALSE 6 6 B 134 499 FALSE 7 7 B 389 401 TRUE 8 8 B 368 972 TRUE 9 9 C 665 356 FALSE 10 10 C 985 488 FALSE
Using this, and looking here, we can do:
library(dplyr)
library(tidyr)
library(purrr)
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~as.numeric(dist(.)))))) %>%
unnest()
# A tibble: 13 x 5 name ID X1 X2 bool <fct> <dbl> <dbl> <dbl> <dbl> 1 A 1 197 109 1 2 A 2 15 765 1 3 A 3 6 183 1 4 A 1 182 874 0 5 A 2 191 292 0 6 A 1 9 582 0 7 B 1 309 7 0 8 B 2 54 91 1 9 B 3 75 480 1 10 B 1 255 98 1 11 B 2 234 473 1 12 B 1 21 571 0 13 C 1 320 132 0
This is unsigned though. Alternatively:
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~combn(., 2, diff))))) %>%
unnest()
# A tibble: 13 x 5 name ID X1 X2 bool <fct> <int> <int> <int> <int> 1 A 1 -197 -109 1 2 A 2 -15 765 1 3 A 3 -6 183 1 4 A 1 182 874 0 5 A 2 191 292 0 6 A 1 9 -582 0 7 B 1 -309 7 0 8 B 2 -54 -91 1 9 B 3 -75 480 1 10 B 1 255 -98 1 11 B 2 234 473 1 12 B 1 -21 571 0 13 C 1 320 132 0