Search code examples
sqlclickhouse

Can't properly group 14 dates by 7 days in Clickhouse


I have 14 dates FROM 2023-07-22 TO 2023-08-04. I want to Group 14 days by 7 days and as a result I want 2 dates with lates days of these groups.

My query

SELECT toStartOfInterval(date, toIntervalDay(7)) AS week
FROM
(
  SELECT arrayJoin([toDate('2023-07-22'), toDate('2023-08-04')]) AS date
  ORDER BY date ASC WITH FILL STEP 1
)
GROUP BY week
ORDER BY week ASC

Results:

┌────────────────week─┐
│ 2023-07-20 00:00:00 │
│ 2023-07-27 00:00:00 │
│ 2023-08-03 00:00:00 │
└─────────────────────┘

Why there is 3 rows instead of 2? From where does this date 2023-07-20 00:00:00 comes from? And what is wrong with dates?

I expected these values:

2023-07-29 00:00:00
2023-08-04 00:00:00

UPD

Link to Fiddle - https://fiddle.clickhouse.com/87410f62-3031-4a55-9520-bfc5ca11e448

(From Clickhouse Telegram support)

UPD.2

Right query for my situation:

with
    toDate('2023-07-22') as DATE_FROM,
    toDate('2023-08-04') as DATE_TO,
    7 as DAYS_LENGTH,
    DATE_FROM % DAYS_LENGTH as DAYS_SHIFT
select
    date - (date - DAYS_SHIFT) % DAYS_LENGTH as days_group
FROM
(
    SELECT arrayJoin([DATE_FROM, DATE_TO]) AS date
    ORDER BY date ASC WITH FILL STEP 1
)
GROUP BY days_group
order by days_group
format PrettyCompactMonoBlock
;

(Thanks for query example to Sergey Churin From Clickhouse Telegram support)


Solution

  • With toStartOfInterval(date, toIntervalDay(7)) you want the beginning of a seven-day interval in which the date resides. Who defines that interval? For the day 2023-07-22 it could be any of the seven intervals from 2023-07-16 to 2023-07-22 to 2023-07-22 to 2023-07-28.

    The DBMS has decided that seven-day intervals always start with a Thursday. This may be caused by using a UNIX date representation internally where the first valid day is 1970-01-01. A Thursday.

    If you want to put your consecutive days into groups of seven, then just number the rows 0 to n and divide by 7 to get a group number:

    with
      7 as DAYS_LENGTH
    SELECT max(date)
    FROM
    (  
      SELECT
        date,
        TRUNC((ROW_NUMBER() OVER(ORDER BY date) - 1) / DAYS_LENGTH) AS grp
      FROM
      (
        SELECT arrayJoin([toDate('2023-07-22'), toDate('2023-08-04')]) AS date
        ORDER BY date ASC WITH FILL STEP 1
      )
    )
    GROUP BY grp
    ORDER BY max(date);
    

    Demo: https://fiddle.clickhouse.com/ccb2ca14-104d-4968-bea9-138a075cbd4f