Search code examples
rlistvectormergegtfs

Merge two datasets based on columns that contain vectors


I would like to merge two data sets that contain vectors in columns in R. Here I provide example data for my problem:

df1<- data.frame(place = c("a", "b", "c"))
df1$stops <-  list(c(1,2,3), c(2,3,4), c(3,4,5))

df2 <- data.frame(stops = c(1,2,3,4,5))
df2$trips <- list(c(343,353), c(492, 4332), c(375, 346), c(110, 109), c(267, 268))

The trips should be matched to a place if their stop is on the 'stops' column on list1. My desired result would be a table with 3 rows and 3 columns (place, stops, trips). Each row represents a place and all the stops and trips associated to that place. Row 1 would look like:

 place    stops    trips
1  "a"    1, 2, 3     343, 353, 492, 4332, 375, 346 

Solution

  • Unfortunately, merging/joining doesn't work directly on nested lists like this, so we need to unnest and join, optionally re-nesting.

    dplyr

    library(dplyr)
    library(tidyr) # unnest
    df1 %>%
      unnest(stops) %>%
      left_join(df2, by = "stops") %>%
      summarize(stops = list(stops), trips = list(unlist(trips)), .by = "place") %>%
      as.data.frame()
    #   place   stops                         trips
    # 1     a 1, 2, 3 343, 353, 492, 4332, 375, 346
    # 2     b 2, 3, 4 492, 4332, 375, 346, 110, 109
    # 3     c 3, 4, 5  375, 346, 110, 109, 267, 268
    

    The trailing as.data.frame() is not required, I included it here merely to show the contents a little more transparently, vice dplyr's tendency to "pretty-ify" things (and hide the nested contents).