We are using memsql 5.1 for a web-analytics project. There is about 80M records and 0,5M records per day. A simple request works about 5 seconds - how many data was received per domain,geo,lang for a given day. I feel it is possible to reduce those time, but i cant find a way. Please tell me the way.
Tables like one
CREATE TABLE `domains` (
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`geo` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`lang` char(5) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`browser` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`os` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`device` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`domain` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`ref` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`blk_cnt` int(11) DEFAULT NULL,
KEY `date` (`date`,`geo`,`lang`,`domain`) /*!90619 USING CLUSTERED COLUMNSTORE */
/*!90618 , SHARD KEY () */
)
request like this one:
memsql> explain SELECT domain, geo, lang, avg(blk_cnt) as blk_cnt, count(*) as cnt FROM domains WHERE date BETWEEN '2016-07-31 0:00' AND '2016-08-01 0:00' GROUP BY domain, geo, lang ORDER BY blk_cnt ASC limit 40;
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [r0.domain, r0.geo, r0.lang, $0 / CAST(COALESCE($1,0) AS SIGNED) AS blk_cnt, CAST(COALESCE($2,0) AS SIGNED) AS cnt] |
| Top limit:40 |
| GatherMerge [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] partitions:all est_rows:40 |
| Project [r0.domain, r0.geo, r0.lang, s / CAST(COALESCE(c,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(cnt_1,0) AS SIGNED) AS cnt, s, c, cnt_1] est_rows:40 |
| TopSort limit:40 [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] |
| HashGroupBy [SUM(r0.s) AS s, SUM(r0.c) AS c, SUM(r0.cnt) AS cnt_1] groups:[r0.domain, r0.geo, r0.lang] |
| TableScan r0 storage:list stream:no |
| Repartition [domains.domain, domains.geo, domains.lang, cnt, s, c] AS r0 shard_key:[domain, geo, lang] est_rows:40 est_select_cost:144350216 |
| HashGroupBy [COUNT(*) AS cnt, SUM(domains.blk_cnt) AS s, COUNT(domains.blk_cnt) AS c] groups:[domains.domain, domains.geo, domains.lang] |
| Filter [domains.date >= '2016-07-31 0:00' AND domains.date <= '2016-08-01 0:00'] |
| ColumnStoreScan scan_js_data.domains, KEY date (date, geo, lang, domain) USING CLUSTERED COLUMNSTORE est_table_rows:72175108 est_filtered:18043777 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
Thank you very match!
Executing the group by is probably the most expensive part of this query. Using shard key that matches the group by, i.e. SHARD KEY (domain, geo, lang)
, will allow the group by to be executed faster.