I've installed the latest (MonetDB 5 server v11v.27.5 "Jul2017-SP1") on Windows 2012 Server and I'm trying to query large table 1,4 billion rows in a reasonable time 2-3s.
Is this even possible with MonetDB? What could I do to improve the performance?
Detailed description of what I've done so far:
Created table:
CREATE TABLE t939ba ( id INT, xa INT, xb INT, ya INT, yb INT, a1 TINYINT, a2 TINYINT, a3 TINYINT, a4 TINYINT, a5 TINYINT, a6 TINYINT, a7 TINYINT, a8 TINYINT, a9 TINYINT);
Loaded the data:
COPY 1450000000 OFFSET 2 RECORDS INTO tbl FROM 'D:\\es_export\\file.csv'
USING DELIMITERS ',' NULL AS '' LOCKED;
Run the query:
SELECT COUNT(DISTINCT id) FROM tbl WHERE a1=22
AND xb>=143455 AND yb>=90911 AND xa<=143615 AND ya<=91007
AND a2 IN (2, 3, 4) AND a3 IN (0, 1, 2, 3, 4) AND a4 IN (0, 1, 2)
AND a5 IN (-1, 1, 2, 3, 4, 5, 6, 7) AND a6 IN (-1, 11, 12, 13, 14);
When I run the query for the 1st time it took (14m 52s), 2nd run of the same query took (3m 23s), 3rd consecutive run of the same query took (14s) and a slightly rearranged query took (3m 11s).
Tomas,
thanks for the plans and traces. I see that you used the revised query with range predicates instead of IN predicates, and that this query now runs in "a mere" ~39 s (compared to ~15 min) --- either because the range predicates are evaluated more efficiently than the IN predicates, or because, as Martin indicated, a later run of the query benefits from indexes that where automatically built by MonetDB when evaluating the first query, or because of both.
In any case, running a/each query (-version) more than once is a good idea to see the possible effect of automatically built indexes.
Further, I see that either you indeed have a 34-core machine, or your machine has "only" 2 GB RAM per core --- not too much given that you have a ~42 GB data set, where each column is ~1.5 GB to ~6 GB in size ...
Hence, the main reason for the query not running faster than ~39 s might be I/O activity due to "lack" of memory.
Best,
Stefan
ps:
You can check whether for this specific query, reducing (or even avoiding) multi-core parallelism helps to reduce I/O thrashing:
Try running your query after disabling MonetDB's "mitosis" optimizer using
set optimizer='no_mitosis_pipe';
You can re-enable full multi-core parallelism using
set optimzer='default_pipe';
Best,
Stefan