Search code examples
rmergereshapereduce

Merge dataframes and fill gaps with duplicated columns and corresponding dates


I have three dataframes (df1, df2 and df3). I would like to merge them and also, fill each others gaps. For example: df1 contains USA data from 1990 to 1993 and df2 from 1994 to 1999. When I merge them using reduce function, I get duplicate data (USA.x, USA.y) instead of just USA with all consecutive dates and values.

I did as follows:

df1 <- data.frame(
  stringsAsFactors = FALSE, 
  Date = c("01/01/1990", "01/01/1991", 
           "01/01/1992", "01/01/1993", "01/01/1994", "01/01/1995"), 
  USA = c(1, 4, 2, 1, NA, NA), 
  FRANCE = c(4, 4, 2, 5, NA, NA), 
  ITALY = c(1, 4, 5, 2, NA, NA))

df2 <-data.frame(
  stringsAsFactors = FALSE, 
  Date = c("01/01/1994", "01/01/1995", 
           "01/01/1996", "01/01/1997", "01/01/1998", "01/01/1999"), 
  USA = c(3, 3, 1, 4, 3, 1), 
  FRANCE = c(2, 5, 2, 5, 5, 1), 
  MEXICO = c(4, 1, 4, 3, NA, NA))

df3 <- data.frame(
  stringsAsFactors = FALSE, 
  Date = c("01/01/1998", "01/01/1999", 
           "01/01/2000", "01/01/2001", "01/01/2002", "01/01/2003"), 
  MEXICO = c(3, 3, 5, 4, 2, 3), 
  BELGIUM = c(4, 2, 1, 4, 5, 1))


df_list <- list(df1, df2, df3)

Option 1

dfall1 <- Reduce(function(x, y) merge(x, y, all=TRUE), df_list, accumulate=FALSE)
View(dfall1)

Results dfall1

It integrates columns according to duplicated names but it duplicates the dates. There will be only one column for each contry but 2 rows of the same date: one with NA, the other with a value from another data frame.

Option 2

dfall2 <- Reduce(function(x, y) merge(x, y, all=TRUE, by = "Date"), df_list, accumulate=FALSE)
View(dfall2)

Results dfall2

Date rows are not repeated by column names will be renamed as .x and .y.

Question How can I avoid duplicated row and columns. I wish for all data to be integrated according to their column name and corresponding dates

I would like to end up with the following:

desired resuts

Any help would be truly appreciated. Thanks in advance.


Solution

  • Reshaping, e.g. using reshape2 package. First, reshape into long format, next rbind and na.omit, finally reshape wide.

    > lapply(df_list, reshape2:::melt.data.frame, id.vars='Date') |> 
    +   do.call(what='rbind') |> na.omit() |> reshape2::dcast(Date ~ variable) 
             Date USA FRANCE ITALY MEXICO BELGIUM
    1  01/01/1990   1      4     1     NA      NA
    2  01/01/1991   4      4     4     NA      NA
    3  01/01/1992   2      2     5     NA      NA
    4  01/01/1993   1      5     2     NA      NA
    5  01/01/1994   3      2    NA      4      NA
    6  01/01/1995   3      5    NA      1      NA
    7  01/01/1996   1      2    NA      4      NA
    8  01/01/1997   4      5    NA      3      NA
    9  01/01/1998   3      5    NA      3       4
    10 01/01/1999   1      1    NA      3       2
    11 01/01/2000  NA     NA    NA      5       1
    12 01/01/2001  NA     NA    NA      4       4
    13 01/01/2002  NA     NA    NA      2       5
    14 01/01/2003  NA     NA    NA      3       1
    

    Data:

    > dput(df_list)
    list(structure(list(Date = c("01/01/1990", "01/01/1991", "01/01/1992", 
    "01/01/1993", "01/01/1994", "01/01/1995"), USA = c(1, 4, 2, 1, 
    NA, NA), FRANCE = c(4, 4, 2, 5, NA, NA), ITALY = c(1, 4, 5, 2, 
    NA, NA)), class = "data.frame", row.names = c(NA, -6L)), structure(list(
        Date = c("01/01/1994", "01/01/1995", "01/01/1996", "01/01/1997", 
        "01/01/1998", "01/01/1999"), USA = c(3, 3, 1, 4, 3, 1), FRANCE = c(2, 
        5, 2, 5, 5, 1), MEXICO = c(4, 1, 4, 3, NA, NA)), class = "data.frame", row.names = c(NA, 
    -6L)), structure(list(Date = c("01/01/1998", "01/01/1999", "01/01/2000", 
    "01/01/2001", "01/01/2002", "01/01/2003"), MEXICO = c(3, 3, 5, 
    4, 2, 3), BELGIUM = c(4, 2, 1, 4, 5, 1)), class = "data.frame", row.names = c(NA, 
    -6L)))