Search code examples
datedatetimetime-seriesdata-analysis

Clarification on Interpreting Hourly Data: Does Hour 1 Represent 00:00-01:00 or 01:00-02:00?


I'm working with a dataset that contains hourly data, and I'm having trouble interpreting how the hours are represented. The dataset has two columns: one for the date and one for the hour. Here's an example of how the data is structured:

Date Hour
2023-01-01 1
2023-01-01 2
2023-01-01 3
... ...
2023-01-01 24

I need to understand whether "2023-01-01 hour 1" refers to:

  • The period from 00:00 to 01:00 (end of the hour convention), or
  • The period from 01:00 to 02:00 (start of the hour convention). I've seen both conventions used in different contexts, and I'm not sure which one applies here. This distinction is crucial for my analysis. This confuses me a bit. Every hour shift up or down could possibly cause wrong outcomes

I haven't found the answer to this question anywhere, so I would like to start a discussion and find out what your interpretation and good practices are in this matter

Additional Information:
  • The dataset does not come with any additional metadata or documentation.
  • It is commonly used in energy data analysis, but I'm unsure if that implies a specific convention.
What I've Tried:
  • Reviewing other datasets and documentation within the same domain, but found mixed conventions.
  • Looking for patterns or descriptions in the dataset itself, but haven't found conclusive evidence.
My Questions:
  • Is there a standard convention for representing hourly data in such tables?
  • How can I determine which convention is being used if the documentation is unclear or unavailable? Are there any common practices or hints I should look for in the dataset that might help clarify this?

Any insights or suggestions on how to approach this issue would be greatly appreciated!


Solution

  • Standard

    Is there a standard convention for representing hourly data in such tables?

    The idea that a value equals to itself is kinda standard. So if there is no contradictory evidence, then the null hypothesis is that 1 means 1:00, 2 means 2:00, ... 24 means 00:00

    H0 or the null hypothesis is the hypothesis you are to make if you have to choose a hypothesis and you have no evidence that would debunk any of the possibilities, nor arguments that would change their likelihood.

    So, keep in mind that we need to assume that 24 is by default meaning 00:00 unless something suggests otherwise. But you will need to investigate (and you already did it to some extent) in order to be sure.

    Documentation

    As you also suggested, documentation is a great place to find any evidence that would shift your position towards a direction. You also stated that the documentation did not say what the 24 means. But you can look for secondary evidence. For example, if your documentation says that in order to sort the times you need to run something like

    select `Date`, `Hour`
    from t
    order by `Date`, `Hour`
    

    and you have Hour values of 24, then, since 24 > 23, it will become clear that 24 is after 23, so 24 must mean the 24th hour, that is, 23:00 and consequently each hour means one less than itself.

    Note that I'm not saying that this is the actual case, I'm just saying that if the hours are numerically sorted to sort by time, then 24 means 00:00.

    Code

    Look into your application code, look how the hours are ordered, look at what comments are there, whether the code suggests that an hour is at the start of the day. Also look into the way data is generated. If you have something like:

    insert into t(`Date`, `Hour`)
    values
    ('2024-08-08', 1),
    ('2024-08-08', 2),
    ('2024-08-08', 3),
    ('2024-08-08', 4),
    ('2024-08-08', 5),
    ('2024-08-08', 6),
    ('2024-08-08', 7),
    ('2024-08-08', 8),
    ('2024-08-08', 9),
    ('2024-08-08', 10),
    ('2024-08-08', 11),
    ('2024-08-08', 12),
    ('2024-08-08', 13),
    ('2024-08-08', 14),
    ('2024-08-08', 15),
    ('2024-08-08', 16),
    ('2024-08-08', 17),
    ('2024-08-08', 18),
    ('2024-08-08', 19),
    ('2024-08-08', 20),
    ('2024-08-08', 21),
    ('2024-08-08', 22),
    ('2024-08-08', 23),
    ('2024-08-08', 24);
    

    then the 24 most likely means 23:00. However, if you have:

    insert into t(`Date`, `Hour`)
    values
    ('2024-08-08', 24),
    ('2024-08-08', 1),
    ('2024-08-08', 2),
    ('2024-08-08', 3),
    ('2024-08-08', 4),
    ('2024-08-08', 5),
    ('2024-08-08', 6),
    ('2024-08-08', 7),
    ('2024-08-08', 8),
    ('2024-08-08', 9),
    ('2024-08-08', 10),
    ('2024-08-08', 11),
    ('2024-08-08', 12),
    ('2024-08-08', 13),
    ('2024-08-08', 14),
    ('2024-08-08', 15),
    ('2024-08-08', 16),
    ('2024-08-08', 17),
    ('2024-08-08', 18),
    ('2024-08-08', 19),
    ('2024-08-08', 20),
    ('2024-08-08', 21),
    ('2024-08-08', 22),
    ('2024-08-08', 23);
    

    then 24 most likely means 00:00.

    Colleagues

    Ask around people who might know

    Formulas

    If you have formulas that convert timezones, then you can often infer the meaning of the input from the formula.

    Summary

    If no further information, choose H0, which is that 1 = 1 is true. But look for contradictory evidence, try to debunk this hypothesis. If you fail to debunk it, then it is the best bet. If you manage to debunk it, then choose the better hypothesis instead.