Search code examples
rjoindata-bindingbinddplyr

Combine two dataframes that have common rows and columns (fill in)


I have a big dataframe that I want to fill in with results from SQL queries to many databases, so kind of "filling in data cubbyholes" so to speak. Wrinkle: I don't know how many cubbyholes will be filled in (there's a group_by year, so I might get a dataframe with one year, or many).

I'm having difficulty figuring out how to accomplish this. I'm trying to use the dplyr package..

  • left_join either adds the same row twice (if I specify the by=), or drops the new column (if I don't specify the by= and thus it joins on both similar columns)enter image description here

    enter image description here

  • bind_cols doesn't work

  • bind_rows adds a duplicate row.

    enter image description here

How can I get the new data to fill in the cubbyhole itself? (btw, I'm not married to dplyr... I just don't want to iterate through every element of the new dataframe)

Code follows:

library(dplyr)
TargetDF <- structure(list(Ind = c(5, 6, 7), `2015 Act` = c(7870L, NA, NA
                                                            )), .Names = c("Ind", "2015 Act"), class = c("tbl_df", "data.frame"
                                                                                                         ), row.names = c(NA, -3L))

tempDF <- structure(list(Ind = 6, `2015 Act` = 49782L, `2016 Act` = 323L), .Names = c("Ind", 
                                                                                      "2015 Act", "2016 Act"), class = c("tbl_df", "tbl", "data.frame"
                                                                                      ), row.names = c(NA, -1L))
left_join(TargetDF,tempDF, by= "Ind")
## gives duplicate columns

left_join(TargetDF,tempDF)
## loses the new "2015 Act" data for Ind 6

bind_cols(TargetDF,tempDF)
## don't work

bind_rows(TargetDF,tempDF)
## double Ind 6 (there are other columns nor included here, which is why I can't !is.na() to eliminate duplicate Ind 6)

Solution

  • One possible way is to get the non-NA values out of each column grouped by Ind, and otherwise, leave (generate) an NA

    full_join(TargetDF, tempDF) %>% 
      group_by(Ind) %>% 
      summarise_each(funs(.[!is.na(.)][1L]))
    
    # Source: local data frame [3 x 3]
    # 
    #     Ind 2015 Act 2016 Act
    #   (dbl)    (int)    (int)
    # 1     5     7870       NA
    # 2     6    49782      323
    # 3     7       NA       NA