I was doing a MODE.com SQL practice question about date format.
The practice question is: Write a query that counts the number of companies acquired within 3 years, 5 years, and 10 years of being founded (in 3 separate columns). Include a column for total companies acquired as well. Group by category and limit to only rows with a founding date.
It uses two tables:
tutorial.crunchbase_companies_clean_date
table, which includes information about all the companies, like company name, founded year, etc.
tutorial.crunchbase_acquisitions_clean_date
table, which includes the information about all the acquired companies, like acquired company name, acquired date, etc.
My code is:
SELECT companies.category_code,
COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '3 years' THEN 1 ELSE NULL END) AS less_than_3_years,
COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '5 years' THEN 1 ELSE NULL END) AS between_3_to_5_years,
COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '10 years' THEN 1 ELSE NULL END) AS within_10_years,
COUNT(1) AS total
FROM tutorial.crunchbase_companies_clean_date companies
JOIN tutorial.crunchbase_acquisitions_clean_date acq
ON companies.permalink = acq.company_permalink
WHERE companies.founded_at_clean IS NOT NULL
GROUP BY 1
ORDER BY total DESC
The result is: My result
The answer query is:
SELECT companies.category_code,
COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '3 years'
THEN 1 ELSE NULL END) AS acquired_3_yrs,
COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '5 years'
THEN 1 ELSE NULL END) AS acquired_5_yrs,
COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '10 years'
THEN 1 ELSE NULL END) AS acquired_10_yrs,
COUNT(1) AS total
FROM tutorial.crunchbase_companies_clean_date companies
JOIN tutorial.crunchbase_acquisitions_clean_date acquisitions
ON acquisitions.company_permalink = companies.permalink
WHERE founded_at_clean IS NOT NULL
GROUP BY 1
ORDER BY 5 DESC
The result is: The answer result
You can see in the screenshots that the results are very similar, but some numbers are different.
The only difference I can see between my query and the answer is in the COUNT statements, but I don't really see the difference, for example, between: acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= '3 years'
and acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '3 years'
I tried adding INTERVAL
in my SELECT statement:
SELECT companies.category_code,
COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= INTERVAL '3 years' THEN 1 ELSE NULL END) AS less_than_3_years,
COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= INTERVAL '5 years' THEN 1 ELSE NULL END) AS between_3_to_5_years,
COUNT(CASE WHEN acq.acquired_at_cleaned - companies.founded_at_clean:: timestamp <= INTERVAL '10 years' THEN 1 ELSE NULL END) AS within_10_years,
COUNT(1) AS total
and remove the INTERVAL
from the answer query:
SELECT companies.category_code,
COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + '3 years'
THEN 1 ELSE NULL END) AS acquired_3_yrs,
COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + '5 years'
THEN 1 ELSE NULL END) AS acquired_5_yrs,
COUNT(CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + '10 years'
THEN 1 ELSE NULL END) AS acquired_10_yrs,
COUNT(1) AS total
But the results are the same.
I tried to know the result of just the difference between the acquired_date and founded_date, to see if the value can be compared with INTERVAL
. The result is in days, which looks promising to me.
The result
I try to give all the information for your consideration. Hope somebody could help. Thank you in advance!
My suggestion is to add/subtract the INTERVAL to/from one date/time and then compare with the other date/time. Don't subtract the date/times and then compare to a string literal. Your database seems to understand '3 YEARS'
as 3 * 365 days
, regardless of the actual number of days between someDateTime
and someDateTime +/- '3 YEARS'
. The actual number of days from year to year could be 365 or 366, depending on whether a leap year is crossed.
Here's a simple example of comparing with a specific interval, which also requires we know whether and how many leap years were crossed.
The test case:
WITH dates AS (
SELECT '2021-01-01'::date AS xdate
)
SELECT xdate - (xdate - INTERVAL '1' YEAR) AS diff
, xdate - (xdate - INTERVAL '1' YEAR) = '1 YEAR' AS b1
, xdate - (xdate - INTERVAL '1' YEAR) = '365 DAYS' AS b2
, xdate - (xdate - INTERVAL '1' YEAR) = '366 DAYS' AS b3
FROM dates
;
-- AND --
WITH dates AS (
SELECT '2021-01-01'::date AS xdate
)
SELECT xdate - (xdate - INTERVAL '1' YEAR) AS diff
, xdate - (xdate - INTERVAL '1' YEAR) = INTERVAL '1' YEAR AS b1
, xdate - (xdate - INTERVAL '1' YEAR) = INTERVAL '365 DAYS' AS b2
, xdate - (xdate - INTERVAL '1' YEAR) = INTERVAL '366 DAYS' AS b3
FROM dates
;
Result:
diff | b1 | b2 | b3 |
---|---|---|---|
366 days | f | f | t |
WITH dates AS (
SELECT '2021-01-01'::date AS xdate
)
, diff AS (
SELECT xdate - (xdate - INTERVAL '1' YEAR) AS diff
FROM dates
)
SELECT diff
, CASE WHEN diff = (366*24*60*60 * INTERVAL '1' SECOND)
THEN 1
END AS compare1
, 366*24*60*60 AS seconds
, CASE WHEN diff = (366*24*60*60 * INTERVAL '1' SECOND)
THEN 1
END AS compare2
, CASE WHEN diff = '31622400 SECONDS'
THEN 1
END AS compare3
FROM diff
;
The result:
diff | compare1 | seconds | compare2 | compare3 |
---|---|---|---|---|
366 days | 1 | 31622400 | 1 | 1 |
Original response:
The behavior shown here (below) is similar to the posted behavior.
The problem is the value generated isn't necessarily what you think.
Here's a test case in postgresql which might be representative of your issue.
Maybe this is related to leap year
, where the number of days in a year isn't constant.
So it's probably safer to compare the dates rather than assume some number of days, which is probably the assumption <= '3 years'
makes.
The test SQL:
WITH test (acquired_at_cleaned, founded_at_clean, n) AS (
SELECT current_date, current_date - INTERVAL '4' YEAR, 4 UNION
SELECT current_date, current_date - INTERVAL '3' YEAR, 3 UNION
SELECT current_date, current_date - INTERVAL '2' YEAR, 2 UNION
SELECT current_date, current_date - INTERVAL '1' YEAR, 1
)
, cases AS (
SELECT test.*
, CASE WHEN acquired_at_cleaned <= founded_at_clean::timestamp + INTERVAL '3' year
THEN 1 ELSE NULL
END AS acquired_3_yrs_case1
, CASE WHEN acquired_at_cleaned - founded_at_clean::timestamp <= '3 year'
THEN 1 ELSE NULL
END AS acquired_3_yrs_case2
, acquired_at_cleaned - founded_at_clean::timestamp AS x1
, acquired_at_cleaned - (n * INTERVAL '1' YEAR) AS x2
FROM test
)
SELECT acquired_at_cleaned AS acquired
, founded_at_clean AS founded
, n
, acquired_3_yrs_case1 AS case1
, acquired_3_yrs_case2 AS case2
, x1, x2
FROM cases
ORDER BY founded_at_clean
;
The result:
acquired | founded | n | case1 | case2 | x1 | x2 |
---|---|---|---|---|---|---|
2021-12-25 | 2017-12-25 00:00:00 | 4 | null | null | 1461 days | 2017-12-26 00:00:00 |
2021-12-25 | 2018-12-25 00:00:00 | 3 | 1 | null | 1096 days | 2018-12-26 00:00:00 |
2021-12-25 | 2019-12-25 00:00:00 | 2 | 1 | 1 | 731 days | 2019-12-26 00:00:00 |
2021-12-25 | 2020-12-25 00:00:00 | 1 | 1 | 1 | 365 days | 2020-12-26 00:00:00 |
Interesting result.