Search code examples
cdateformulaepoch

Is the mktime64 implementation simple by pure coincidence or by design?


Background

Note: I previously posted this on Math Exchange (link), but it didn't meet the guidelines.

I came across the Linux Kernel's implementation for converting a Gregorian calendar date to a Unix epoch (seconds elapsed since 1970/01/01 00:00:00 UTC). I was surprised by how few lines of code were required to implement this function, so naturally I wanted to understand how and why this equation works. While most of the terms have an obvious meaning, I wanted to know:

  • Where do the magic numbers 367 and 719499 come from?
  • How and why are the month and day arguments 1-indexed? I.e. 2012 March 3rd -> (2012, 3, 3)

Code snippet below (source permalink):

time64_t mktime64(const unsigned int year0, const unsigned int mon0,
        const unsigned int day, const unsigned int hour,
        const unsigned int min, const unsigned int sec)
{
    unsigned int mon = mon0, year = year0;

    /* 1..12 -> 11,12,1..10 */
    if (0 >= (int) (mon -= 2)) {
        mon += 12;  /* Puts Feb last since it has leap day */
        year -= 1;
    }

    return ((((time64_t)
          (year/4 - year/100 + year/400 + 367*mon/12 + day) +
          year*365 - 719499
        )*24 + hour /* now have hours - midnight tomorrow handled here */
      )*60 + min /* now have minutes */
    )*60 + sec; /* finally seconds */
}

This can be written as the following formula:

Unix Epoch Formula

where the calendar date is offset by -2 months.

This post will attempt to explain each term in this equation represents, make note of assumptions and limitations of this function, and where I think these magic constants come from.

For the sake of simplicity and readability, all subsequent code snippets will be in Python.

The Obvious - Epoch Given Time & Days since 1st Jan 1970

Epoch given time and days since Jan 1 1970

The Less Obvious - Days since 1st Jan 1970 Given a Calendar Date

The conditional block of code references the beginning of a year to March 1st, as this is the day after February's leap-day. This offsets the reference date from 0000/01/01 to 0000/03/01 for the calculation of total leapdays, but this is seemingly accounted for in the magic number 719499.

This allows us to now calculate the number of leapdays in the Gregorian calendar since the year 0 (1 BC) using simple integer division:

Formula for number of leapyears since Gregorian 0000/01/01

which is equivalent to the term year/4 - year/100 + year/400 as integer division in C is floor division.

Since the formula for Gregorian leapdays has a reference date of 0000/01/01, perhaps 719499 is the number of days from 0000/01/01 to 1970/01/01 on the Gregorian calendar? If we declare a formula for days since Gregorian 0000/03/01, where years and months have already been adjusted as per the above if statement:

Formula for total days since Gregorian 0000/01/01

If we then adjust the unix epoch reference date (1970, 1, 1) -> (1969, 11, 1) with the aforementioned conditional code, and plug the result into the formula, we find the source of the first magic number:

Days between Gregorian reference and epoch reference Jan 1 1970

Therefore,

Number of days since Jan 1 1970

where year and month are offset by -2 months.

"Simplifying" the Original Function

Lets re-write the original function in python and split the equation up into the obvious and complicated parts into their own functions.

def days_since_gregorian_y0(year, month, day):
    if (0 >= (month := month-2)):
        month += 12
        year -= 1
    return (
        (year//4 - year//100 + year//400) +  # Leapdays since 0000/03/01
        (367 * month) // 12 + day +  # Year day from 03-01 reference (?)
        year * 365  # Days per year
    )


def mktime64(year, month, day, hour, minute, second):
    return (
        (
            (
                days_since_gregorian_y0(year, month, day) -
                days_since_gregorian_y0(1970, 1, 1)
            ) * 24 + hour
        ) * 60 + minute
    ) * 60 + second

Days and Months Being 1-indexed

Have established that the number if days since 1970/01/01 is a difference between two calculations for days since 0000/03/01. As long as the indexing is consistent between the two values being subtracted, the date and epoch reference, then the result should make sense. We've shown that the magic number 719499 uses 1-indexed month and day, so we therefore need to use 1-indexed month and day when calling mktime64.

Where does 367 come from?

If we compare an array of days for each month and the difference between consecutive values for the term (367 * m) // 12 where month is 1-indexed (1-12), we see the resulting array is similar to the months per day array. The only differences being a shift by one, and the number of days in February differing.

month_days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
unknown_term_values = [(367 * (m+1)) // 12 - (367 * m) // 12 for m in range(12)]

print(month_days[1:] + month_days[:1])
[28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31]

print(unknown_term_values)
[30, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31]

Conveniently, since month is indexed from 1, and there is always a difference being calculated between days since Gregorian 0000/03/01, this incorrect value for days in February always cancels out.

Summary

A few things seem to be in alignment here for such a simple formula:

  • The first day of the year for the sake of calculation is adjusted to be the March 1st, as it would be immediately preceded by a leapday if there is one that year.
  • The difference 367*m1//12 - 367*m0//12, correctly yields the number of days between any two months.
  • Day and month are 1-indexed not only for usability, but because the two prior points require day and month to be 1-indexed. The inaccuracy from the 367*m//12 formula for m=1, which is the month of February, is nullified because of the difference of 1-indexed dates.

The mktime64 docstring seems to reference Gauss as a source for part of this formula, but I haven't been able to track this down anywhere.

Either this simplicity may be a total coincidence, or perhaps whoever created the Roman calendar utilized the quotient of 367*m//12 to determine what months have 30 days, and what months have 31 days.

If anyone happens to know the source of this algorithm, and whether the 367*m//12 term was indeed used in deriving the number of days per month, I'd love to see a source!

Thank you :-)


Solution

  • Is the mktime64 implementation simple by pure coincidence or by design?

    It is by design: mktime64() takes advantage of an early form of the Roman Calendar that starts the years in March. Starting the year in March also explains why the months names September, October, November, December sound like 7th-month, 8th-month, 9th-month and 10th-month. Leap day concerns are then at the end of the year. This simplifies year-month-day-hour-minute-second to time number calculations.

    mktime64() is not well designed to handle out-of-range months and years like the 100th month of the year with sensible results.

    mktime64() does not handle overflow well like years above UINT_MAX/365 as year*365 overflows unsigned math. (IMO, this is a potential bug given the use of time64_t elsewhere. I'd expect year*(time64_t)365.) **

    It also takes advantage of the general days/month distribution of 30 and 31 days with a fraction of 367/12 or 307/12. With February at the end, with its 28,29 days, is irrelevant as the days per the month of February is not important - there is no month after February so the day length of February is not used. Since each month has at [30 ... 31] days, the month multiplier constant is likewise constrained to [30.0 ... 31.0]. To achieve month_index * month_multiplier_constant, but with only integer math and get the sequence [0,31,61,92, ...] (days since March 1st), we need to perform (m+offset1)*month_multiplier_constant_numerator / month_multiplier_constant_denominator + offset2. offset2 is swallowed up into the -719499 constant. offset1 is conveniently part of mon -= 2. The numerator/denominator, whose math quotient needs to be in the [30.0 ... 31.0] range is selected as 367/12 as if the total days in the year was 367 (a few extra in February - remember that total days in February are irrelevant) and that makes 367/12 per month.
    Rather than 367/12 (i.e. 30 and 7/12), I would have used 979/32 (i.e. 30 + 19/32) as division by a power-of-2 is often a tad more efficient.
    The key is that a multiplier of ~30.583 to ~30.599 is needed.

    The year/4 - year/100 + year/400 clearly implies the Gregorian Calendar, which was first used in 1582 and the Proleptic Gregorian calendar for earlier dates. This form of extra days per year calculation was a simple extension of the Julian calendar year/4.

    719499 is simply a day constant to shift the time64_t result to an epoch of Jan 1, 1970 for typical *nix use.


    ** Even if code does not need to handle years > UINTMAX/365:

    ((time64_t)(year/4 - year/100 + year/400 + 367*mon/12 + day) + year*365 - 719499)*24
    // better as
    ((year/4 - year/100 + year/400 + 367*mon/12 + day + year*365) - (time64_t)719499)*24
    

    since there is scant advantage in doing much of the addition with time64_t math versus just unsigned.

    When (time64_t) is signed, subtraction the 719499 useful as - (time64_t)719499 to shift the math to signed math.


    To map mon 1,2,3, ... 12 to [0,31,61,92, ... 337] (days since March 1st), multiply mon by 30 and a fraction. Floor it and then offset.

    Days_Since_March_1 = floor(mon*(30+fraction)) - 30;
    

    Days_Since_March_1(mon)

    Fraction is in range [.5833 ... 0.6000).