Search code examples
rdatetimesplitlubridateposixct

Splitting multiple date and time variables & computing time average in R


I have the following dataset wherein, I have the person's ID, district and sub-district they live in along with the last date/time on which they uploaded data to the server. The variables "last_down_" contain the last date/time on which a person the uploaded data and are named in such a way that they show the date on which I had downloaded the data on the same. For example, "last_upload_2020-06-12" would mean I downloaded the data from the server on 12th June.

For the below dataset, I would like to spilt the date and time in each of the variables (all at once) in a way that the new separated variables which are created go by the name "last_date_(my download date)" & "last_time_(my download date)"

 district block id  last_upload_2020-06-12 last_upload_2020-06-13 last_upload_2020-06-14 last_upload_2020-06-15
    A   X   11  2020-02-06 11:53:19.0   2020-02-06 11:53:19.0   2020-02-06 11:53:19.0   2020-02-06 11:53:19.0
    A   X   12  2020-06-11 12:40:26.0   2020-06-11 12:40:26.0   2020-06-14 11:40:26.0   2020-06-15 18:50:26.0
    A   X                                                       2020-06-14 11:08:12.0   2020-06-14 11:08:12.0
    A   X   14  2020-06-12 11:31:07.0   2020-06-13 11:31:07.0   2020-06-14 17:37:07.0   2020-06-14 17:37:07.0
    A   Y   15  2020-06-10 12:45:48.0   2020-06-10 12:45:48.0   2020-06-10 12:45:48.0   2020-06-10 12:45:48.0
    A   Y   16  2020-04-04 02:26:57.0   2020-04-04 02:26:57.0   2020-04-04 02:26:57.0   2020-04-04 02:26:57.0
    A   Y   17  2020-03-31 08:10:03.0   2020-03-31 08:10:03.0   2020-03-31 08:10:03.0   2020-03-31 08:10:03.0
    A   Y   18  2020-05-30 12:08:15.0   2020-05-30 12:08:15.0   2020-05-30 12:08:15.0   2020-05-30 12:08:15.0
    A   Z   19  2020-04-09 15:21:52.0   2020-04-09 15:21:52.0   2020-04-09 15:21:52.0   2020-04-09 15:21:52.0
    A   Z   20  2020-05-30 17:42:33.0   2020-05-30 17:42:33.0   2020-05-30 17:42:33.0   2020-05-30 17:42:33.0
    A   Z   21  2020-04-12 14:23:29.0   2020-04-12 14:23:29.0   2020-04-12 14:23:29.0   2020-04-12 14:23:29.0
    A   Z   22  2020-05-13 23:18:19.0   2020-05-13 23:18:19.0   2020-05-13 23:18:19.0   2020-05-13 23:18:19.0
    A   X   23  2020-04-30 09:53:31.0   2020-04-30 09:53:31.0   2020-04-30 09:53:31.0   2020-04-30 09:53:31.0
    A   X   24  2020-06-10 10:28:59.0   2020-06-10 10:28:59.0   2020-06-10 10:28:59.0   2020-06-15 11:31:33.0
    A   Y   25              
    A   Y   26  2020-05-30 12:14:09.0   2020-05-30 12:14:09.0   2020-05-30 12:14:09.0   2020-05-30 12:14:09.0
    B   E   31              
    B   C   32  2020-06-12 16:43:23.0   2020-06-12 16:43:23.0   2020-06-12 16:43:23.0   2020-06-12 16:43:23.0
    B   C   33  2019-10-24 22:30:35.0   2019-10-24 22:30:35.0   2019-10-24 22:30:35.0   2019-10-24 22:30:35.0
    B   C   34  2020-06-09 15:38:18.0   2020-06-09 15:38:18.0   2020-06-09 15:38:18.0   2020-06-15 14:35:41.0
    B   C   35  2020-06-11 14:39:51.0   2020-06-11 14:39:51.0   2020-06-11 14:39:51.0   2020-06-11 14:39:51.0
    B   D   36  2020-06-12 11:53:15.0   2020-06-12 11:53:15.0   2020-06-12 11:53:15.0   2020-06-15 13:02:39.0
    B   D   37  2020-04-21 15:43:43.0   2020-04-21 15:43:43.0   2020-04-21 15:43:43.0   2020-04-21 15:43:43.0
    B   D   38  2020-05-13 04:07:17.0   2020-05-13 04:07:17.0   2020-05-13 04:07:17.0   2020-05-13 04:07:17.0
    B   E   39  2020-04-30 13:51:20.0   2020-04-30 13:51:20.0   2020-04-30 13:51:20.0   2020-04-30 13:51:20.0
    B   E   40  2020-05-12 16:51:01.0   2020-05-12 16:51:01.0   2020-05-12 16:51:01.0   2020-05-12 16:51:01.0
    B   E   41  2020-04-16 12:14:24.0   2020-04-16 12:14:24.0   2020-04-16 12:14:24.0   2020-04-16 12:14:24.0
    B   C   42  2018-06-07 15:12:18.0   2018-06-07 15:12:18.0   2018-06-07 15:12:18.0   2018-06-07 15:12:18.0
    B   D   43  2019-09-28 10:08:51.0   2019-09-28 10:08:51.0   2019-09-28 10:08:51.0   2019-09-28 10:08:51.0

N.B: my date/time variables are numeric.

Once I get the data in shape, I would also like to do the following:

  1. Get the year and month of all observations under "last_upload_2020-06-12" in a separate column.

  2. Similarly, for the last date in my dataset that is "last_upload_2020-06-15". Can I automate R picking the last date something like Sys.Date()-1? I will always have the data for one date less than current.

  3. Calculate the average upload time per ID, i.e., generally around what time does a person upload data to the server? Average should be based on unique time values.

Would be extremely helpful if someone could help solve this!

Thanks, Rachita


Solution

  • The Df looked so complicated that I thought it might be better to replicate it. I then used a function to take every column you wanted and separate it into the last_date and last_time as wanted. Inside the function the temporary DF is cbind to a DF built outside of the loop. This DF consisted out of the columns which are not treated in the loop. The result of this loop is the DF as wanted. [colnames got a little long]

    The key for the second task was to transfer to last_time to hours, then grouping und summarizing.

    I hope this is what you wanted.

    I think with this as a basis you can deal with no2.

    There were some warnings which had to do with NA's.

    More explanation in the reprex below.

    library(tidyverse)
    
    
    df <- read.table(text = '
    district block id  last_upload_2020_06_12 last_upload_2020_06_13 last_upload_2020_06_14 last_upload_2020_06_15
    "A"   "X"   11  "2020-02-06 11:53:19.0"   "2020-02-06 11:53:19.0"   "2020-02-06 11:53:19.0"   "2020-02-06 11:53:19.0"
    "A"   "X"   12  "2020-06-11 12:40:26.0"   "2020-06-11 12:40:26.0"   "2020-06-14 11:40:26.0"   "2020-06-15 18:50:26.0"
    "A"   "X"   NA  "NA"                      "NA"                      "2020-06-14 11:0812.0"    "2020-06-14 11:0812.0"
    "A"   "X"   14  "2020-06-12 11:31:07.0"   "2020-06-13 11:31:07.0"   "2020-06-14 17:37:07.0"   "2020-06-14 17:37:07.0"
    "A"   "Y"   15  "2020-06-10 12:45:48.0"   "2020-06-10 12:45:48.0"   "2020-06-10 12:45:48.0"   "2020-06-10 12:45:48.0"
    "A"   "Y"   16  "2020-04-04 02:26:57.0"   "2020-04-04 02:26:57.0"   "2020-04-04 02:26:57.0"   "2020-04-04 02:26:57.0"
    "A"   "Y"   17  "2020-03-31 08:10:03.0"   "2020-03-31 08:10:03.0"   "2020-03-31 08:10:03.0"   "2020-03-31 08:10:03.0"
    "A"   "Y"   18  "2020-05-30 12:08:15.0"   "2020-05-30 12:08:15.0"   "2020-05-30 12:08:15.0"   "2020-05-30 12:08:15.0"
    "A"   "Z"   19  "2020-04-09 15:21:52.0"   "2020-04-09 15:21:52.0"   "2020-04-09 15:21:52.0"   "2020-04-09 15:21:52.0"
    "A"   "Z"   20  "2020-05-30 17:42:33.0"   "2020-05-30 17:42:33.0"   "2020-05-30 17:42:33.0"   "2020-05-30 17:42:33.0"
    "A"   "Z"   21  "2020-04-12 14:23:29.0"   "2020-04-12 14:23:29.0"   "2020-04-12 14:23:29.0"   "2020-04-12 14:23:29.0"
    "A"   "Z"   22  "2020-05-13 23:18:19.0"   "2020-05-13 23:18:19.0"   "2020-05-13 23:18:19.0"   "2020-05-13 23:18:19.0"
    "A"   "X"   23  "2020-04-30 09:53:31.0"   "2020-04-30 09:53:31.0"   "2020-04-30 09:53:31.0"   "2020-04-30 09:53:31.0"
    "A"   "X"   24  "2020-06-10 10:28:59.0"   "2020-06-10 10:28:59.0"   "2020-06-10 10:28:59.0"   "2020-06-15 11:31:33.0"
    "A"   "Y"   25  " "                        ""                     ""                         ""
    "A"   "Y"   26  "2020-05-3012:14:09.0"   "2020-05-30 12:14:09.0"   "2020-05-30 12:14:09.0"   "2020-05-30 12:14:09.0"
    "B"   "E"   31  ""            ""                      ""                        ""          ""
    "B"   "C"   32  "2020-06-1 16:43:23.0"   "2020-06-12 16:43:23.0"   "2020-06-12 16:43:23.0"   "2020-06-12 16:43:23.0"
    "B"   "C"   33  "2019-10-24 22:30:35.0"   "2019-10-24 22:30:35.0"   "2019-10-24 22:30:35.0"   "2019-10-24 22:30:35.0"
    "B"   "C"   34  "2020-06-09 15:38:18.0"   "2020-06-09 15:38:18.0"   "2020-06-09 15:38:18.0"   "2020-06-15 14:35:41.0"
    "B"   "C"   35  "2020-06-11 14:39:51.0"   "2020-06-11 14:39:51.0"   "2020-06-11 14:39:51.0"   "2020-06-11 14:39:51.0"
    "B"   "D"   36  "2020-06-12 11:53:15.0"   "2020-06-12 11:53:15.0"   "2020-06-12 11:53:15.0"   "2020-06-15 13:02:39.0"
    "B"   "D"   37  "2020-04-21 15:43:43.0"   "2020-04-21 15:43:43.0"   "2020-04-21 15:43:43.0"   "2020-04-21 15:43:43.0"
    "B"   "D"   38  "2020-05-13 04:07:17.0"   "2020-05-13 04:07:17.0"   "2020-05-13 04:07:17.0"   "2020-05-13 04:07:17.0"
    "B"   "E"   39  "2020-04-30 13:51:20.0"   "2020-04-30 13:51:20.0"   "2020-04-30 13:51:20.0"   "2020-04-30 13:51:20.0"
    "B"   "E"   40  "2020-05-12 16:51:01.0"   "2020-05-12 16:51:01.0"   "2020-05-12 16:51:01.0"   "2020-05-12 16:51:01.0"
    "B"   "E"   41  "2020-04-16 12:14:24.0"   "2020-04-16 12:14:24.0"   "2020-04-16 12:14:24.0"   "2020-04-16 12:14:24.0"
    "B"   "C"   42  "2018-06-07 15:12:18.0"   "2018-06-07 15:12:18.0"   "2018-06-07 15:12:18.0"   "2018-06-07 15:12:18.0"
    "B"   "D"   43  "2019-09-28 10:08:51.0"   "2019-09-28 10:08:51.0"   "2019-09-28 10:08:51.0"   "2019-09-28 10:08:51.0"
               ', header =T)
    
    
    # TASK: create for each column which contains 'last_upload' new columns 
    # with date and time
    
    # get the colnames of the cols to be split or separated
    ccl <- colnames(df %>% select(last_upload_2020_06_12:last_upload_2020_06_15))
    
    # create new DF with first 3 columns, to which other columns are bound in
    # the following function
    dff <- df %>% select(district:id) 
    
    # function to separate each col in ccl to _date and _time
    for (cl in ccl) {
      tmp <- separate(df,
        col = cl, sep = " ",
        into = c(paste0(cl, "_date"), paste0(cl, "_time"))
      ) %>%
        select(contains("_date") | contains("_time"))
      dff <- cbind(dff, tmp)
    }
    
    
    dff %>% head()
    #>   district block id last_upload_2020_06_12_date last_upload_2020_06_12_time
    #> 1        A     X 11                  2020-02-06                  11:53:19.0
    #> 2        A     X 12                  2020-06-11                  12:40:26.0
    #> 3        A     X NA                        <NA>                        <NA>
    #> 4        A     X 14                  2020-06-12                  11:31:07.0
    #> 5        A     Y 15                  2020-06-10                  12:45:48.0
    #> 6        A     Y 16                  2020-04-04                  02:26:57.0
    #>   last_upload_2020_06_13_date last_upload_2020_06_13_time
    #> 1                  2020-02-06                  11:53:19.0
    #> 2                  2020-06-11                  12:40:26.0
    #> 3                        <NA>                        <NA>
    #> 4                  2020-06-13                  11:31:07.0
    #> 5                  2020-06-10                  12:45:48.0
    #> 6                  2020-04-04                  02:26:57.0
    #>   last_upload_2020_06_14_date last_upload_2020_06_14_time
    #> 1                  2020-02-06                  11:53:19.0
    #> 2                  2020-06-14                  11:40:26.0
    #> 3                  2020-06-14                   11:0812.0
    #> 4                  2020-06-14                  17:37:07.0
    #> 5                  2020-06-10                  12:45:48.0
    #> 6                  2020-04-04                  02:26:57.0
    #>   last_upload_2020_06_15_date last_upload_2020_06_15_time
    #> 1                  2020-02-06                  11:53:19.0
    #> 2                  2020-06-15                  18:50:26.0
    #> 3                  2020-06-14                   11:0812.0
    #> 4                  2020-06-14                  17:37:07.0
    #> 5                  2020-06-10                  12:45:48.0
    #> 6                  2020-04-04                  02:26:57.0
    
    # TASK: Calculate the average time of a day each id does a download
    # new DF from original brought into long format
    # split the date/time into last_date and last_time
    ddf <- df %>%
      pivot_longer(cols = last_upload_2020_06_12:last_upload_2020_06_15) %>%
      separate(col = value, sep = ' ', into = c('last_date', 'last_time')) %>%
      mutate(last_date = lubridate::ymd(last_date), last_time= lubridate::hms(last_time))
    
    
    # calculating the mean hour of the day at which each id does a 
    # download, by calculating last_time to hours (of the day) and
    # after grouping build mean hour
    ddf %>% 
      mutate(hours = as.numeric(lubridate::hms(last_time), unit = 'hour')) %>%
      group_by(id) %>% summarise(meanHourOfTheDay = mean(hours, na.rm = T))
    
    #> # A tibble: 29 x 2
    #>       id meanHourOfTheDay
    #>    <int>            <dbl>
    #>  1    11            11.9 
    #>  2    12            14.0 
    #>  3    14            14.6 
    #>  4    15            12.8 
    #>  5    16             2.45
    #>  6    17             8.17
    #>  7    18            12.1 
    #>  8    19            15.4 
    #>  9    20            17.7 
    #> 10    21            14.4 
    #> # … with 19 more rows