I am interested in doing a fairly complicated join and I have not been able to find the answer. Here is an example dataset:
dat2 <- data.frame(age = c(2,2), id = c("T1", "T2"), Height = c(1.1,1.2), Number = c(1,1), node_age = c(0, 0))
dat3 <- data.frame(age = c(3,3,3,3), id = c("T1", "T1", "T2", "T2"), Height = c(1.1,2.2, 1.2, 2.3), Number = c(1,2,1,2), node_age = c(1,0,1,0))
dat4 <- data.frame(age = c(4,4,4,4,4,4), id = c("T1", "T1", "T1", "T2", "T2", "T2"), Height = c(1.1,2.2,3.3,1.2, 2.3,3.4 ), Number = c(1,2,3,1,2,3), node_age = c(2,1,0,2,1,0))
dat_list <- list(dat2, dat3, dat4)
I would like to take this list and join everything together, so that the outcome would look something like this:
$`T1`
id Height Number_2 node_age_2 Number_3 node_age_3 Number_4 node_age_4
1 T1 1.1 1 0 1 1 1 2
2 T1 2.2 NA NA 2 0 2 1
3 T1 3.3 NA NA NA NA 3 0
$T2
id Height Number_2 node_age_2 Number_3 node_age_3 Number_4 node_age_4
4 T2 1.2 1 0 1 1 1 2
5 T2 2.3 NA NA 2 0 2 1
6 T2 3.4 NA NA NA NA 3 0
Where the output is a list of lists by id and "Number" and "node_age" are repeated as functions of "age".
I have made some progress with gather, unite and spread. But I feel that it is inefficient to bind_rows of the list, then spread, then combine back into the list. Maybe I am wrong here?
My true data is a large list (115 elements). When I bind_rows of my list, I end up with 233561 observations. So I have many id's and node_age goes until 115, which is why I am aiming for list structures.
Thanks in advance.
We could bind the rows, then split
by 'id' and dcast
to 'wide' format
library(tidyverse)
library(data.table)
dat_list %>%
bind_rows %>%
split(.$id) %>%
map(~ dcast(as.data.table(.x), id + Height ~ age,
value.var = c( 'Number', 'node_age')))
#$T1
# id Height Number_2 Number_3 Number_4 node_age_2 node_age_3 node_age_4
#1: T1 1.1 1 1 1 0 1 2
#2: T1 2.2 NA 2 2 NA 0 1
#3: T1 3.3 NA NA 3 NA NA 0
#$T2
# id Height Number_2 Number_3 Number_4 node_age_2 node_age_3 node_age_4
#1: T2 1.2 1 1 1 0 1 2
#2: T2 2.3 NA 2 2 NA 0 1
#3: T2 3.4 NA NA 3 NA NA 0
Or we use gather/spread
instead of dcast
dat_list %>%
bind_rows %>%
split(.$id) %>%
map(~ .x %>%
gather(key, val, Number:node_age) %>%
unite(keyage, key, age) %>%
spread(keyage, val))
#$T1
# id Height node_age_2 node_age_3 node_age_4 Number_2 Number_3 Number_4
#1 T1 1.1 0 1 2 1 1 1
#2 T1 2.2 NA 0 1 NA 2 2
#3 T1 3.3 NA NA 0 NA NA 3
#$T2
# id Height node_age_2 node_age_3 node_age_4 Number_2 Number_3 Number_4
#1 T2 1.2 0 1 2 1 1 1
#2 T2 2.3 NA 0 1 NA 2 2
#3 T2 3.4 NA NA 0 NA NA 3
If we need the columns in a particular order
, create the 'key' as a factor
columns with levels
specified in that order to change the order
in spread
dat_list %>%
bind_rows %>%
split(.$id) %>%
map(~ .x %>%
gather(key, val, Number:node_age) %>%
group_by(key) %>%
mutate(rn = row_number()) %>%
ungroup %>%
arrange(rn) %>%
unite(keyage, key, age) %>%
mutate(keyage = factor(keyage, levels = unique(keyage))) %>%
select(-rn) %>%
spread(keyage, val))
#$T1
# A tibble: 3 x 8
# id Height Number_2 node_age_2 Number_3 node_age_3 Number_4 node_age_4
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 T1 1.1 1 0 1 1 1 2
#2 T1 2.2 NA NA 2 0 2 1
#3 T1 3.3 NA NA NA NA 3 0
#$T2
# A tibble: 3 x 8
# id Height Number_2 node_age_2 Number_3 node_age_3 Number_4 node_age_4
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 T2 1.2 1 0 1 1 1 2
#2 T2 2.3 NA NA 2 0 2 1
#3 T2 3.4 NA NA NA NA 3 0