I'm brand new to the concept of database administration, so I have no basis for what to expect. I am working with approximately 100GB of data in the form of five different tables. Descriptions of the data, as well as the first few rows of each file, can be found here.
I'm currently just working with the flows
tables in an effort to gauge performance. Here is the results from \d flows
:
Table "public.flows"
Column | Type | Modifiers
------------+-------------------+-----------
time | real |
duration | real |
src_comp | character varying |
src_port | character varying |
dest_comp | character varying |
dest_port | character varying |
protocol | character varying |
pkt_count | real |
byte_count | real |
Indexes:
"flows_dest_comp_idx" btree (dest_comp)
"flows_dest_port_idx" btree (dest_port)
"flows_protocol_idx" btree (protocol)
"flows_src_comp_idx" btree (src_comp)
"flows_src_port_idx" btree (src_port)
Here is the results from EXPLAIN ANALYZE SELECT src_comp, COUNT(DISTINCT dest_comp) FROM flows GROUP BY src_comp;
, which I thought would be a relatively simple query:
GroupAggregate (cost=34749736.06..35724568.62 rows=200 width=64) (actual time=1292299.166..1621191.771 rows=11154 loops=1)
Group Key: src_comp
-> Sort (cost=34749736.06..35074679.58 rows=129977408 width=64) (actual time=1290923.435..1425515.812 rows=129977412 loops=1)
Sort Key: src_comp
Sort Method: external merge Disk: 2819360kB
-> Seq Scan on flows (cost=0.00..2572344.08 rows=129977408 width=64) (actual time=26.842..488541.987 rows=129977412 loops=1)
Planning time: 6.575 ms
Execution time: 1636290.138 ms
(8 rows)
If I'm interpreting this correctly (which I might not be since I'm new to PSQL), it's saying that my query would take almost 30 minutes to execute, which is much, much longer than I would expect. Even with ~130 million rows.
My computer is running with an 8th-gen i7 quad-core CPU, 16GBs of RAM, and a 2TB HDD (full specs can be found here).
My questions then are: 1) is this performance to be expected, and 2) is there anything I can do to speed it up, other than buying an external SSD?
1 - src_comp and dest_comp, which are used by the query, are both indexed. However, they are indexed independently. If you had an index of 'src_comp, dest_comp' then there is a possibility that the database could process this all via indexes, eliminating a full table scan.
2 - src_comp and dest_comp are character varying. That is NOT a good thing for indexed fields, unless necessary. What are these values really? Numbers? IP addresses? Computer network names? If there is a relatively finite number of these items and they can be identified as they are added to the database, change them to integers that are used as foreign keys into other tables. That will make a HUGE difference in this query. If they can't be stored that way, but they at least have a definite finite length - e.g., 15 characters for IPv4 addresses in dotted quad format - then set a maximum length for the fields, which should help some.