Search code examples
sqlpostgresqldatetimecaseintervals

SQL why does dateA - dateB <= '3 years' give a different result than dateA <= dateB + '3 years'


I was doing a MODE.com SQL practice question about date format.

The practice question is: Write a query that counts the number of companies acquired within 3 years, 5 years, and 10 years of being founded (in 3 separate columns). Include a column for total companies acquired as well. Group by category and limit to only rows with a founding date.

It uses two tables:

tutorial.crunchbase_companies_clean_date table, which includes information about all the companies, like company name, founded year, etc.

tutorial.crunchbase_acquisitions_clean_datetable, which includes the information about all the acquired companies, like acquired company name, acquired date, etc.

My code is:

SELECT companies.category_code,
       COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '3 years' THEN 1 ELSE NULL END) AS less_than_3_years,
       COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '5 years' THEN 1 ELSE NULL END) AS between_3_to_5_years,
       COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '10 years' THEN 1 ELSE NULL END) AS within_10_years,
       COUNT(1) AS total
FROM tutorial.crunchbase_companies_clean_date companies
JOIN tutorial.crunchbase_acquisitions_clean_date acq 
ON companies.permalink = acq.company_permalink
WHERE companies.founded_at_clean IS NOT NULL
GROUP BY 1
ORDER BY total DESC

The result is: My result

The answer query is:

SELECT companies.category_code,
       COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '3 years'
                       THEN 1 ELSE NULL END) AS acquired_3_yrs,
       COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '5 years'
                       THEN 1 ELSE NULL END) AS acquired_5_yrs,
       COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '10 years'
                       THEN 1 ELSE NULL END) AS acquired_10_yrs,
       COUNT(1) AS total
  FROM tutorial.crunchbase_companies_clean_date companies
  JOIN tutorial.crunchbase_acquisitions_clean_date acquisitions
    ON acquisitions.company_permalink = companies.permalink
 WHERE founded_at_clean IS NOT NULL
 GROUP BY 1
 ORDER BY 5 DESC

The result is: The answer result

You can see in the screenshots that the results are very similar, but some numbers are different.

The only difference I can see between my query and the answer is in the COUNT statements, but I don't really see the difference, for example, between: acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '3 years' and acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '3 years'

I tried adding INTERVAL in my SELECT statement:

SELECT companies.category_code,
       COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= INTERVAL '3 years' THEN 1 ELSE NULL END) AS less_than_3_years,
       COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= INTERVAL '5 years' THEN 1 ELSE NULL END) AS between_3_to_5_years,
       COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= INTERVAL '10 years' THEN 1 ELSE NULL END) AS within_10_years,
       COUNT(1) AS total

and remove the INTERVAL from the answer query:

SELECT companies.category_code,
       COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + '3 years'
                       THEN 1 ELSE NULL END) AS acquired_3_yrs,
       COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + '5 years'
                       THEN 1 ELSE NULL END) AS acquired_5_yrs,
       COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + '10 years'
                       THEN 1 ELSE NULL END) AS acquired_10_yrs,
       COUNT(1) AS total

But the results are the same.

I tried to know the result of just the difference between the acquired_date and founded_date, to see if the value can be compared with INTERVAL. The result is in days, which looks promising to me. The result

I try to give all the information for your consideration. Hope somebody could help. Thank you in advance!


Solution

  • My suggestion is to add/subtract the INTERVAL to/from one date/time and then compare with the other date/time. Don't subtract the date/times and then compare to a string literal. Your database seems to understand '3 YEARS' as 3 * 365 days, regardless of the actual number of days between someDateTime and someDateTime +/- '3 YEARS'. The actual number of days from year to year could be 365 or 366, depending on whether a leap year is crossed.

    Here's a simple example of comparing with a specific interval, which also requires we know whether and how many leap years were crossed.

    Fiddle

    The test case:

    WITH dates AS (
            SELECT '2021-01-01'::date AS xdate
         )
    SELECT xdate - (xdate - INTERVAL '1' YEAR) AS diff
         , xdate - (xdate - INTERVAL '1' YEAR) = '1 YEAR'   AS b1
         , xdate - (xdate - INTERVAL '1' YEAR) = '365 DAYS' AS b2
         , xdate - (xdate - INTERVAL '1' YEAR) = '366 DAYS' AS b3
      FROM dates
    ;
    
    -- AND --
    
    WITH dates AS (
            SELECT '2021-01-01'::date AS xdate
         )
    SELECT xdate - (xdate - INTERVAL '1' YEAR) AS diff
         , xdate - (xdate - INTERVAL '1' YEAR) = INTERVAL '1' YEAR   AS b1
         , xdate - (xdate - INTERVAL '1' YEAR) = INTERVAL '365 DAYS' AS b2
         , xdate - (xdate - INTERVAL '1' YEAR) = INTERVAL '366 DAYS' AS b3
      FROM dates
    ;
    

    Result:

    diff b1 b2 b3
    366 days f f t

    Fiddle

    WITH dates AS (
            SELECT '2021-01-01'::date AS xdate
         )
       , diff AS (
            SELECT xdate - (xdate - INTERVAL '1' YEAR) AS diff
              FROM dates
         )
    SELECT diff
         , CASE WHEN diff = (366*24*60*60 * INTERVAL '1' SECOND)
                THEN 1
            END AS compare1
         , 366*24*60*60 AS seconds
         , CASE WHEN diff = (366*24*60*60 * INTERVAL '1' SECOND)
                THEN 1
            END AS compare2
         , CASE WHEN diff = '31622400 SECONDS'
                THEN 1
            END AS compare3
      FROM diff
    ;
    

    The result:

    diff compare1 seconds compare2 compare3
    366 days 1 31622400 1 1

    Original response:

    The fiddle for PostgreSQL

    The behavior shown here (below) is similar to the posted behavior.

    The problem is the value generated isn't necessarily what you think.

    Here's a test case in postgresql which might be representative of your issue.

    Maybe this is related to leap year, where the number of days in a year isn't constant.

    So it's probably safer to compare the dates rather than assume some number of days, which is probably the assumption <= '3 years' makes.

    The test SQL:

    WITH test (acquired_at_cleaned, founded_at_clean, n) AS (
            SELECT current_date, current_date - INTERVAL '4' YEAR, 4 UNION
            SELECT current_date, current_date - INTERVAL '3' YEAR, 3 UNION
            SELECT current_date, current_date - INTERVAL '2' YEAR, 2 UNION
            SELECT current_date, current_date - INTERVAL '1' YEAR, 1
         )
       , cases AS (
            SELECT test.*
                 , CASE WHEN acquired_at_cleaned <= founded_at_clean::timestamp + INTERVAL '3' year
                        THEN 1 ELSE NULL
                    END AS acquired_3_yrs_case1
                 , CASE WHEN acquired_at_cleaned - founded_at_clean::timestamp <= '3 year'
                        THEN 1 ELSE NULL
                    END AS acquired_3_yrs_case2
                 , acquired_at_cleaned - founded_at_clean::timestamp AS x1
                 , acquired_at_cleaned - (n * INTERVAL '1' YEAR) AS x2
              FROM test
         )
    SELECT acquired_at_cleaned AS acquired
         , founded_at_clean    AS founded
         , n
         , acquired_3_yrs_case1 AS case1
         , acquired_3_yrs_case2 AS case2
         , x1, x2
      FROM cases
     ORDER BY founded_at_clean
    ;
    

    The result:

    acquired founded n case1 case2 x1 x2
    2021-12-25 2017-12-25 00:00:00 4 null null 1461 days 2017-12-26 00:00:00
    2021-12-25 2018-12-25 00:00:00 3 1 null 1096 days 2018-12-26 00:00:00
    2021-12-25 2019-12-25 00:00:00 2 1 1 731 days 2019-12-26 00:00:00
    2021-12-25 2020-12-25 00:00:00 1 1 1 365 days 2020-12-26 00:00:00

    Interesting result.