r datetime split data.table unix-timestamp

Convert timestamps in Google Finance stock data to proper datetime

I am trying to convert the timestamps in the stock data from Google Finance API to a more usable datetime format.

I have used data.table::fread to read the data here:

fread(<url>)

      datetime    open     high      low    close  volume
1: a1497619800 154.230 154.2300 154.2300 154.2300     500
2:           1 153.720 154.3200 153.7000 154.2500 1085946
3:           2 153.510 153.8000 153.2000 153.7700   34882
4:           3 153.239 153.4800 153.1400 153.4800   24343
5:           4 153.250 153.3000 152.9676 153.2700   20212

As you can see, the "datetime" format is rather strange. The format is described in this link:

The full timestamps are denoted by the leading 'a'. Like this: a1092945600. The number after the 'a' is a Unix timestamp. [...]

The numbers without a leading 'a' are "intervals". So, for example, the second row in the data set below has an interval of 1. You can multiply this number by our interval size [...] and add it to the last Unix Timestamp.

In my case, the "interval size" is 300 seconds (5 minutes). This format is restarted at the start of each new day and so trying to format it is quite difficult!

I can pull out the index positions of where the day starts are by using grep and searching for "a";

newDay <- grep(df$V1, pattern = "a")

Then my idea was to split the dataframe into chunks depending on index positions then extend the unix times on each day separately followed by combing them back to a data.table, before storing.

data.table::split looks like it will do the job, but I am unsure of how to supply it the day breaks to split by index positions, or if there is a more logical way to achieve the same result without having to break it down to each day.

Thanks.

Solution

You may use grepl to search for "a" in "datetime", which results in a boolean vector. cumsum the boolean to create a grouping variable - for each "a" (TRUE), the counter will increase by one.

Within each group, convert the first element to POSIXct, using an appropriate format and origin (and timezone, tz?). Add multiples of the 'interval size' (300 sec), using zero for the first element and the "datetime" multiples for the others.

d[ , time := {
  t1 <- as.POSIXct(datetime[1], format = "a%s", origin = "1970-01-01")
  .(t1 + c(0, as.numeric(datetime[-1]) * 300))
}
, by = .(cumsum(grepl("^a", datetime)))]

d
#       datetime                time
# 1: a1497619800 2017-06-16 15:30:00
# 2:           1 2017-06-16 15:35:00
# 3:           2 2017-06-16 15:40:00
# 4:           3 2017-06-16 15:45:00
# 5:           4 2017-06-16 15:50:00
# 6: a1500000000 2017-07-14 04:40:00
# 7:           3 2017-07-14 04:55:00
# 8:           5 2017-07-14 05:05:00
# 9:           7 2017-07-14 05:15:00

Some toy data:

d <- fread(input = "datetime    
a1497619800 
          1 
          2 
           3 
           4
a1500000000
3
5
7")