I have data from multiple files that have data hourly observations for different variables. I'm trying to combine them to a single dataframe where I have every observation for each variable for the whole time interval. Some variables have data in multiple dataframes for different time intervals. Each dataframe has a time column that can be used for joining.
The problem is that full_join
creates more rows than my data has hours (df1
). Instead I would like to get a dataframe (df2
) without NA
values and extra rows. One solution is to join the dataframes in specific order but I'm hoping for a more general solution that works with larger scale for combining datasets for this type of data.
library(tidyverse)
a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)
list <- list(a,b,c)
df1 <- list %>%
reduce(full_join)
df1
hour x y
1 1 5 15
2 2 5 15
3 3 5 15
4 4 5 15
5 5 5 15
6 6 5 15
7 7 5 15
8 8 5 15
9 9 5 15
10 10 5 15
11 11 NA 15
12 12 NA 15
13 13 NA 15
14 14 NA 15
15 15 NA 15
16 11 10 NA
17 12 10 NA
18 13 10 NA
19 14 10 NA
20 15 10 NA
21 16 10 NA
22 17 10 NA
23 18 10 NA
24 19 10 NA
25 20 10 NA
df2 <- full_join(a,c) %>%
full_join(b)
df2
hour x y
1 1 5 15
2 2 5 15
3 3 5 15
4 4 5 15
5 5 5 15
6 6 5 15
7 7 5 15
8 8 5 15
9 9 5 15
10 10 5 15
11 11 10 15
12 12 10 15
13 13 10 15
14 14 10 15
15 15 10 15
16 16 10 NA
17 17 10 NA
18 18 10 NA
19 19 10 NA
20 20 10 NA
EDIT: I got the correct result with tidyverse with following code:
df <- list %>%
reduce(full_join) %>%
group_by(hour) %>%
summarise_all(mean, na.rm = T)
With data.table melt
and dcast
also seem to work
dt1 <- as.data.table(df1)
m <- as.data.table(melt(dt1, id.vars = "hour"))
dc <- dcast.data.table(m, hour~variable, fun.aggregate = mean, na.rm = T)
df <- as.data.frame(dc)
I guess mean
can be replaced with some other function but I couldn't figure which.
Here is a version where the primary key can be a composite (for example hour
and group
). It doesn't work with aggregating like your own solution you added to the question.
## Identify (composite) primary key
mykeys <- c("hour")
## Using the data.table package below
list %>%
lapply(setDT)
## Create a dataset with all unique combinations of the keys to join onto
full <- lapply(list, function(dt) dt[,..mykeys]) %>%
rbindlist %>%
unique
## Can make the keys as actual keys on the full dataset but not necessary
# setkeyv(full,mykeys)
## Join all variables onto the full dataset.
## The preserved values of duplicates are from the last dataset in the order of
## the list object. This step is done by reference.
list %>%
lapply(function(dt) {
n <- names(dt)
full[dt,
on = mykeys,
(n) := mget(paste0("i.", n))]
})
full
Here is an example dataset with a composite primary key and more columns to help identify what is going on. Again, note that the order of the datasets in list
matter.
mykeys <- c("hour"
,"group"
)
a <- data.frame(hour = 1:10,
group = 1,
x = 5, z = 1, a = 1)
b <- data.frame(hour = 1:15,
group = 1,
y = 15, z = 2, b = 1)
c <- data.frame(hour = 11:20,
group = 1,
# group = 2,
x = 10, z = 3, c = 1)