Search code examples
rpaddingintervals

How to pad the datetime column of a data frame?


I have a dataframe that has an incomplete datetime variables. The interval of the data is the 0.5 second. I want to find what instances are missing and to insert a record for each of the missing time points. For example, the data was shown below:

df <- structure(list(time = structure(c(1652182189, 1652182189, 1652182190, 1652182191
), class = c("POSIXct", "POSIXt"), tzone = ""), positionx = 3:6),
row.names = c(NA, -4L), class = "data.frame")

I tried to solve it with pad():

pad(df, interval = " .5 sec")

However, there is an error as below because that the smallest unit of the interval of pad() is the sec.

Error: interval is not valid.

I want pad the dataframe as below. How do I solve this problem?

  time                positionx
  <dttm>                  <int>
1 2022-05-10 19:29:49         3
2 2022-05-10 19:29:49         4
3 2022-05-10 19:29:50         5
4 2022-05-10 19:29:50        NA
5 2022-05-10 19:29:51         6


Solution

  • library(tidyverse)
    library(lubridate)
    
    tibble(time=seq(min(df$time), max(df$time), by=seconds(0.5))) %>% 
      left_join(df, by="time")
    # A tibble: 6 × 2
      time                positionx
      <dttm>                  <int>
    1 2022-05-10 12:29:49         3
    2 2022-05-10 12:29:49         4
    3 2022-05-10 12:29:49        NA
    4 2022-05-10 12:29:50         5
    5 2022-05-10 12:29:50        NA
    6 2022-05-10 12:29:51         6
    

    Edit In response to OP's question in comment. (And I accept I should have spotted the difference between my output and OP's desired result before posting.)

    I believe the problem is not in my algorithm, but in OP's input data.

    Compare

    strftime(df$time,'%Y-%m-%d %H:%M:%OS3')
    "2022-05-10 12:29:49.000" "2022-05-10 12:29:49.000" "2022-05-10 12:29:50.000" "2022-05-10 12:29:51.000"
    

    with

    strftime(seq(min(df$time), max(df$time), by=seconds(0.5)),'%Y-%m-%d %H:%M:%OS3')
    [1] "2022-05-10 12:29:49.000" "2022-05-10 12:29:49.500" "2022-05-10 12:29:50.000" "2022-05-10 12:29:50.500" "2022-05-10 12:29:51.000"
    

    In the input data, rows 1 and 2 represent the same time (at 12:29:49.000). The algorithm correctly adds a third row at 12:29:49.500. All three times appear identical using the default format used to display datetime objects in tibbles.

    If OP's input data is instead,

    df <- structure(list(time = structure(c(1652182189.0, 1652182189.5, 1652182190.0, 1652182191.0
    ), class = c("POSIXct", "POSIXt"), tzone = ""), positionx = 3:6),
    row.names = c(NA, -4L), class = "data.frame")
    

    Then the output is

    tibble(time=seq(min(df$time), max(df$time), by=seconds(0.5))) %>% 
      left_join(df, by="time")
    # A tibble: 5 × 2
      time                positionx
      <dttm>                  <int>
    1 2022-05-10 12:29:49         3
    2 2022-05-10 12:29:49         4
    3 2022-05-10 12:29:50         5
    4 2022-05-10 12:29:50        NA
    5 2022-05-10 12:29:51         6
    

    As expected.