I have a table with people and the groups they belong to. It is formatted like so:
person_id <- c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2")
year <- c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)
group_id <- c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
example <- data.frame(person_id, group_id, year)
I want to create a column that for each person shows which other people they share a group with in a particular year. This is what I have currently:
example <- within(example, {
connections <- ave(person_id, group_id, year,
FUN=function(x) paste(x, collapse=', ')
)})
res <- example %>%
group_by(person_id, year) %>%
summarise(joint=paste(connections, collapse=', '))
This is close to what I need, but I do not want to include the person_id in each row. For example, my code creates a new column with the first value of "A1,A2,B1,A1,C2". I would like have this value be "A2,B1,C2". In my example person B1 does not share a group with anyone in 2016. My code produces the row value of "B1", but I would like this cell to be an empty string. How can I achieve this?
Also, the data I am working with is quite large, approximately 1 billion rows. It seems rather inefficient to group by twice, but I am not sure what I want to do is possible without doing so. Is there a better way to approach this?
Note: I cannot use tidyr. Also, my R version is 3.4.3.
code without tidyr
:
library(dplyr)
library(tibble)
example %>%
mutate(p_y = paste(person_id, year, sep = '_'), person_id =NULL, year = NULL) %>%
igraph::graph_from_data_frame() %>%
igraph::components() %>%
igraph::membership() %>%
enframe() %>%
filter(grepl('_', name)) %>%
mutate(person_id = sub("_.*", "", name), year = sub(".*_", "", name))%>%
group_by(year, value) %>%
mutate(value = list(person_id)) %>%
group_by(person_id, year) %>%
summarise(value = toString(setdiff(value[[1]], person_id)), .groups = 'drop')
# A tibble: 12 × 3
person_id year value
<chr> <chr> <chr>
1 A1 2015 "A2, B1, C2"
2 A1 2016 "A2, C2"
3 A2 2015 "A1, B1, C2"
4 A2 2016 "A1, C2"
5 A3 2015 "C1"
6 A3 2016 "C1"
7 B1 2015 "A1, A2, C2"
8 B1 2016 ""
9 C1 2015 "A3"
10 C1 2016 "A3"
11 C2 2015 "A1, A2, B1"
12 C2 2016 "A1, A2"
Edit:
In the case where no transitivity:
example %>%
group_by(year, group_id) %>%
mutate(v = list(person_id)) %>%
group_by(person_id, year) %>%
summarise(v = toString(setdiff(unlist(v), person_id)))
# A tibble: 12 × 3
# Groups: person_id [6]
person_id year v
<chr> <dbl> <chr>
1 A1 2015 "A2, B1, C2"
2 A1 2016 "A2, C2"
3 A2 2015 "A1, B1"
4 A2 2016 "A1"
5 A3 2015 "C1"
6 A3 2016 "C1"
7 B1 2015 "A1, A2"
8 B1 2016 ""
9 C1 2015 "A3"
10 C1 2016 "A3"
11 C2 2015 "A1"
12 C2 2016 "A1"