Search code examples
rdataframedplyrleft-join

how to use multiple left_joins without leftover NAS


I am working with a handful of dataframes whose information I would like to combine.

df1 <- data.frame(id = c(1, 1, 2, 2), year = c(1990, 2000, 1990, 2000))
df1990 <- data.frame(id = c(1, 2), year = c(1990, 1990), X= c(1, 2))
df2000 <- data.frame(id = c(1, 2), year = c(2000, 2000), X= c(1, 2))

Above is code for creating toy inputs. I want to append the information in df2 and df3 to df1, resulting in a dataframe like this

df <- data.frame(id = c(1, 1, 2, 2), year = c(1990, 2000, 1990, 2000), X = c(1, 1, 2, 2))

To do this, my first thought was to use left_join() but I can only do this successfully once -- it works with the first attempted merge, but the NAs remain NAs when I try to do a second merge. So I run:

df <- left_join(df1, df1990)
df <- left_join(df, df2000)

But I still have NAs. Any idea how I can fix this?


Solution

  • As suggested by TarJae in comments, create the right-hand side of the join, then join only once.

    Assuming your dfs are all named dfYEAR, you can pull them out of the workspace

    year_dfs <- lapply(grep("df[0-9]{4}", ls(), value = TRUE), get)
    

    (even better would be to load them into a list to begin with)

    Then join the two tables

    # base-r
    merge(df1, do.call(rbind, year_dfs))
    
    # dplyr
    year_dfs |> bind_rows() |> right_join(df1)
    

    Note that if you have non-unique combinations of id and year in the year_dfs, you will end up with more rows than df1 started with.