Search code examples
data-warehousedststar-schema

What to do when daylight savings results in duplicate data rows?


I have a fact table for energy consumption as follows:

f_meter_data: 

utc_calendar_id
local_calendar_id
meter_id
reading
timestamp

The calendar table is structured as per the Kimball recommendations, and it's the recommendations in the Data Warehouse Toolkit that are why I have the two calendar IDs so users can query on local and UTC time.

This is all well and good but the problems arise when daylight savings kicks in.

As the granularity is half hour periods, there will be a duplicate fact records when the clocks change.

And when the clocks change in the other direction there will be a gap in the data.

How can I handle this situation?

Should I average the duplicate values and store that instead?

And for when it's a gap in the data, should I use an average of the point immediately before and the point immediately after the gap?


Solution

  • I have a feeling this question may end up getting closed as "primarily opinion based", but my particular opinion is that the system should be set up to deal with the fact that not every day has exactly 24 hours. There may be 23, 24 or 25. (Or, if you're on Lord Howe Island, 23.5, 24 or 24.5).

    Depending on when your extra hour falls (which will be different for each time zone), you may have something like:

    00 01a 01b 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    

    Or you might consider coupling the hour with the local UTC offset, like:

    00-04:00 01-04:00 01-05:00 02-05:00 03-05:00  etc... 
    

    Or if you're doing half-hour buckets:

    00:00-04:00  00:30-04:00  01:00-04:00  01:30-04:00  01:00-05:00  01:30-05:00 ...
    

    It probably wouldn't be appropriate to do any averaging to align to 24 hours. If you did, then totals would be off.

    You also should consider how people will be using the data. Will they be trying to figure out trends across a given hour of the day? If so, then how will they compensate for a spike or dip caused by the DST transition? It may be as simple as putting an asterisk and footnote on the output report. Or it may be much more involved than that, depending on the usage.

    Also, you said you're working with 30-minute intervals. Be aware that there are some time zones that are 45-minute offset (Nepal, Chatham Islands, and a small region in Australia). So if you're trying to cover the entire world then you would need 15-minute interval buckets.

    And, as Whichert pointed out in comments, if you're using UTC then there is no daylight saving time. It's only when you group by local-time that you'll have this concern.

    You may also find the graphs in the DST tag wiki useful.