r transformation normalization data-preprocessing

Normalize a time stamp data

I have a large set of data which is in the form of of numeric data type which defines time in 24 hour format in HHMM form.

Since the data type is numeric, the preceding zeroes are absent. A sample of the data can be found here:

> dput(sleepDiary_1[1:100,3:4])

structure(list(`What time did you get into bed? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)
*please make sure its 4 digits (2 for hours, 2 for minutes)` = c(2330, 
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0, 
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250, 
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200, 
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314, 
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310, 
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45, 
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100, 
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `What time did you try and go to sleep? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)` = c(2330, 
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0, 
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330, 
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200, 
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314, 
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310, 
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100, 
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330, 
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA, 
-100L), class = c("tbl_df", "tbl", "data.frame"))

I wish to normalise the columns so I can perform further analysis. Turns out I'm not sure which normalisation shall work the best. I tried to look at the various possible options for non-normal data, but none of them speaks about the a cycled data which recycles after a certain period, i.e., after 2400 the time changes back to 0000, and thus the values don't keep on adding but are cycled.

To add, the data is regarding the sleep timings and wake up timings from different participants recorded in a study. Turns out we wish normalize the data and remove any outliers which may be present.

Cheers!

Solution

I think this gets you closer to what you want. I started with renaming the columns.

library(ggplot2)
names(df) <- c("bed_try", "sleep_try")
ggplot(df, aes(bed_try, sleep_try)) + geom_point()

To convert from hhmm to hours, where after the decimal, we have fractional hours:

convert_hhmm <- function(hhmm) {
  floor(hhmm / 100) +
    (hhmm - floor(hhmm / 100) * 100) / 60
}

Pick an arbitrary start to the sleep period - 2000 looks good Change all times to hhmm after "pivot time" Since we want hours after pivot time, we can subtract it from times > that and add 2400 - pivot time to the rest

pivot_time <- 2000

Convert bed_try to new column, bed_plus

df$bed_plus <- df$bed_try - pivot_time
df$bed_plus[df$bed_plus < 0] <- df$bed_plus[df$bed_plus < 0] + 
                                 pivot_time + # back to bed_try
                                 (2400 - pivot_time)
df$bed_plus <- convert_hhmm(df$bed_plus)

Convert sleep_try to new column, sleep_plus

df$sleep_plus <- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus < 0] <- df$sleep_plus[df$sleep_plus < 0] + 
  pivot_time +
  (2400 - pivot_time)
df$sleep_plus <- convert_hhmm(df$sleep_plus)

Exploratory plot

ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()

ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

Remove negatives - alternatively, figure out how to correct them.

df <- df[-which((df$sleep_plus - df$bed_plus) < 0), ]

Results look more like what you want?

ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()