Search code examples
rvectormergedataframegaps-in-data

R - combine vectors in data frame filling gaps in date


I have monthly observed and modeled data organized as vectors:

obs <- structure(c(68.72228685137, 68.4565130874024, 68.3237563140977, 
66.1789683147099, 63.7162100107148, 59.9698454002755), .Names = c("X1901.01.01", 
"X1901.02.01", "X1901.03.01", "X1901.04.01", "X1901.05.01", "X1901.06.01"
))

mod <- structure(c(71.5796750030741, 71.5925210418478, 70.8672045288309, 
67.9705857323206, 68.462614970737, 67.7095309202574), .Names = c("X1899.11.01", 
"X1899.12.01", "X1901.01.01", "X1901.02.01", "X1901.03.01", "X1901.04.01"
))

where X1901.01.01 corresponds to 1901-01-01 and so on. Please note that dates in observed and modeled data don't overlap completely.

This is just a sample - my real data contains thousands of observations.

What is the most efficient (i.e. fastest) way to combine these vectors in a data frame assigning NA to non-matching dates and getting rid of the infamous "X" in the front of the original dates?

This would be the resulting data frame:

   date         obs             mod
1899.11.01      NA              71.57968
1899.12.01      NA              71.59252
1901.01.01      68.72229        70.86720    
1901.02.01      68.45651        67.97059
1901.03.01      68.32376        68.46261    
1901.04.01      66.17897        67.70953    
1901.05.01      63.71621            NA
1901.06.01      59.96985            NA

Solution

  • While @Alex A.'s answer works, since it's date/time data it might be beneficial to treat it this way from the beginning. You can easily merge these using the merge() function with the all=TRUE flag set, which will merge on any identical column names:

    obs <- as.data.frame(obs)
    mod <- as.data.frame(mod)
    obs[["date"]] <- as.Date(row.names(obs), "X%Y.%m.%d")
    mod[["date"]] <- as.Date(row.names(mod), "X%Y.%m.%d")
    
    d <- merge(obs, mod, all=TRUE)
    

    Since the date columns are date/time class you could then easily convert the data.frame to an xts time series or something else for subsetting, summarizing, etc.