Search code examples
rdataframejoinouter-join

Merge dataframes without duplicate rows with NA values


I have data from multiple files that have data hourly observations for different variables. I'm trying to combine them to a single dataframe where I have every observation for each variable for the whole time interval. Some variables have data in multiple dataframes for different time intervals. Each dataframe has a time column that can be used for joining.

The problem is that full_join creates more rows than my data has hours (df1). Instead I would like to get a dataframe (df2) without NA values and extra rows. One solution is to join the dataframes in specific order but I'm hoping for a more general solution that works with larger scale for combining datasets for this type of data.

library(tidyverse)
a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)

list <- list(a,b,c)

df1 <- list %>%
  reduce(full_join)

df1
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 NA 15
12   12 NA 15
13   13 NA 15
14   14 NA 15
15   15 NA 15
16   11 10 NA
17   12 10 NA
18   13 10 NA
19   14 10 NA
20   15 10 NA
21   16 10 NA
22   17 10 NA
23   18 10 NA
24   19 10 NA
25   20 10 NA

df2 <- full_join(a,c) %>%
  full_join(b)

df2
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 10 15
12   12 10 15
13   13 10 15
14   14 10 15
15   15 10 15
16   16 10 NA
17   17 10 NA
18   18 10 NA
19   19 10 NA
20   20 10 NA


EDIT: I got the correct result with tidyverse with following code:

df <- list %>%
  reduce(full_join) %>%
  group_by(hour) %>%
  summarise_all(mean, na.rm = T)

With data.table meltand dcast also seem to work

dt1 <- as.data.table(df1)
m <- as.data.table(melt(dt1, id.vars = "hour"))
dc <- dcast.data.table(m, hour~variable, fun.aggregate = mean, na.rm = T)
df <- as.data.frame(dc)

I guess mean can be replaced with some other function but I couldn't figure which.


Solution

  • Here is a version where the primary key can be a composite (for example hour and group). It doesn't work with aggregating like your own solution you added to the question.

    ## Identify (composite) primary key 
    mykeys <- c("hour")
    
    ## Using the data.table package below
    list %>%
        lapply(setDT)
    
    ## Create a dataset with all unique combinations of the keys to join onto
    full <- lapply(list, function(dt) dt[,..mykeys]) %>%
        rbindlist %>%
        unique
    
    ## Can make the keys as actual keys on the full dataset but not necessary
    # setkeyv(full,mykeys)
    
    ## Join all variables onto the full dataset.
    ## The preserved values of duplicates are from the last dataset in the order of 
    ## the list object. This step is done by reference.
    list %>%
        lapply(function(dt) {
            n <- names(dt)
            full[dt,
                 on = mykeys,
                 (n) := mget(paste0("i.", n))]
        })
    full
    

    Here is an example dataset with a composite primary key and more columns to help identify what is going on. Again, note that the order of the datasets in list matter.

    mykeys <- c("hour"
                ,"group"
    )
    
    a <- data.frame(hour = 1:10, 
                    group = 1,
                    x = 5, z = 1, a = 1)
    b <- data.frame(hour = 1:15, 
                    group = 1,
                    y = 15, z = 2, b = 1)
    c <- data.frame(hour = 11:20, 
                    group = 1,
                    # group = 2,
                    x = 10, z = 3, c = 1)