Search code examples
rdataframeposixct

how to fill the gaps with values present in each column in a dataframe in r?


This is how my data looks like:

dput(head(COR_trial,10))
structure(list(rDate = structure(c(1439995500, 1439995800, 1439996100, 
1439996400, 1439996700, 1439997000, 1439997300, 1439997600, 1439997900, 
1439998200), class = c("POSIXct", "POSIXt"), tzone = ""), CCRN630 = c(NA, 
NA, NA, NA, NA, 0.2878412, NA, NA, NA, NA), CCRN800 = c(NA, NA, 
NA, NA, 0.3213675, NA, NA, NA, NA, NA), CCRN532 = c(NA, NA, NA, 
0.3327465, NA, NA, NA, NA, NA, NA), CCRN570 = c(NA, NA, NA, NA, 
NA, 0.4172932, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

And this is the head of the dataframe:

                rDate   CCRN630   CCRN800   CCRN532   CCRN570
1 2015-08-19 14:45:00        NA        NA        NA        NA
2 2015-08-19 14:50:00        NA        NA        NA        NA
3 2015-08-19 14:55:00        NA        NA        NA        NA
4 2015-08-19 15:00:00        NA        NA 0.3327465        NA
5 2015-08-19 15:05:00        NA 0.3213675        NA        NA
6 2015-08-19 15:10:00 0.2878412        NA        NA 0.4172932

From 2015-08-19 14:45:00 to 2018-10-11 13:00:00 (data distributed 5 in 5 minutes), I have 14 values for each one of the columns REFN630, REFN800, REFN532 and REFN570 (Note: their values occur in separate date and time, however sometimes they match)

I will give an example of one column REFN630 and the dates were the values are distributed.

  rDate               CCRN630
  <dttm>                <dbl>
1 2015-08-19 15:10:00   0.288
2 2015-10-23 10:40:00   0.129
3 2016-02-03 12:40:00   0.373
4 2016-03-24 13:25:00   0.392
5 2016-06-21 11:50:00   0.144
6 2016-07-15 11:35:00   0.195
7 2016-08-18 11:35:00   0.204
8 2016-12-20 13:00:00   0.22
9 2017-01-18 13:25:00   0.210
10 2017-02-17 13:05:00   0.237
11 2017-03-29 12:10:00   0.2  
12 2017-05-03 10:30:00   0.174
13 2017-06-08 12:20:00   0.157
14 2017-07-11 11:55:00   0.164

As you can see a LOT of empty spaces need to be filled between the dates. There's any way to fill the empty spaces with the same value until the next defined value?

I found a way to interpolate the values with this code:

CCRN630<-fillGap(COR_trial$CCRN630, method=c("fixed"),rule=2)

Now I want to do something more simple than the interpolation but I don't know how to do it.

The desired output would be something like this:

                rDate   CCRN630   CCRN800   CCRN532   CCRN570
1       2015-08-19 14:45:00 0.2878412 0.3213675 0.3327465 0.4172932
2       2015-08-19 14:50:00 0.2878412 0.3213675 0.3327465 0.4172932
3       2015-08-19 14:55:00 0.2878412 0.3213675 0.3327465 0.4172932
4       2015-08-19 15:00:00 0.2878412 0.3213675 0.3327465 0.4172932
5       2015-08-19 15:05:00 0.2878412 0.3213675 0.3327465 0.4172932
6       2015-08-19 15:10:00 0.2878412 0.3213675 0.3327465 0.4172932

18670   2015-08-19 14:45:00 0.2878412 0.3213675 0.3327465 0.4172932
18671   2015-08-19 14:50:00 0.2878412 0.3213675 0.3327465 0.4172932
18672   2015-10-23 10:40:00 0.1287671 0.1181319 0.2111437 0.2463768
18673   2015-08-19 15:00:00 0.1287671 0.1181319 0.2111437 0.2463768
18674   2015-08-19 15:05:00 0.1287671 0.1181319 0.2111437 0.2463768
18675   2015-08-19 15:10:00 0.1287671 0.1181319 0.2111437 0.2463768

Any help will be much appreciated.


Solution

  • If I understand your question the right way, you could use dplyr and tidyr:

    library(dplyr)
    library(tidyr)
    
    COR_trial %>%
      complete(rDate = seq(min(rDate), max(rDate), by=300)) %>%
      fill(starts_with("CCRN"))
    
    • complete creates the missing Dates and Times
    • seq(min(rDate), max(rDate), by=300) creates a sequence of dates and times starting at the lowest date/time in your dataset. The steps are always 5 minutes, counted in seconds, therefore by = 300.
    • fill uses the known values und fills the rows until the next known value. If you want the rows filled upwards, you can change fill(starts_with("CCRN")) with fill(starts_with("CCRN"), .direction="up").

    This returns

    # A tibble: 18,648 x 5
       rDate               CCRN630 CCRN800 CCRN532 CCRN570
       <dttm>                <dbl>   <dbl>   <dbl>   <dbl>
     1 2015-08-19 16:45:00  NA      NA      NA      NA    
     2 2015-08-19 16:50:00  NA      NA      NA      NA    
     3 2015-08-19 16:55:00  NA      NA      NA      NA    
     4 2015-08-19 17:00:00  NA      NA       0.333  NA    
     5 2015-08-19 17:05:00  NA       0.321   0.333  NA    
     6 2015-08-19 17:10:00   0.288   0.321   0.333   0.417
     7 2015-08-19 17:15:00   0.288   0.321   0.333   0.417
     8 2015-08-19 17:20:00   0.288   0.321   0.333   0.417
     9 2015-08-19 17:25:00   0.288   0.321   0.333   0.417
    10 2015-08-19 17:30:00   0.288   0.321   0.333   0.417
    

    Data

    structure(list(rDate = structure(c(1445589600, 1439995500, 1439995800, 
    1439996100, 1439996400, 1439996700, 1439997000, 1439997300, 1439997600, 
    1439997900, 1439998200), tzone = "", class = c("POSIXct", "POSIXt"
    )), CCRN630 = c(0.129, NA, NA, NA, NA, NA, 0.2878412, NA, NA, 
    NA, NA), CCRN800 = c(NA, NA, NA, NA, NA, 0.3213675, NA, NA, NA, 
    NA, NA), CCRN532 = c(NA, NA, NA, NA, 0.3327465, NA, NA, NA, NA, 
    NA, NA), CCRN570 = c(NA, NA, NA, NA, NA, NA, 0.4172932, NA, NA, 
    NA, NA)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
    "data.frame"), problems = structure(list(row = 11L, col = NA_character_, 
        expected = "4 columns", actual = "5 columns", file = "literal data"), row.names = c(NA, 
    -1L), class = c("tbl_df", "tbl", "data.frame")))