I've been tasked with estimating the marginal cost of joining an additional table in reporting queries written in SQL for the PostgreSQL database. I was given a test program to test the performance of the query. There appears to be no statistically significant difference between run times whether or not the joined table is populated. How can I write a better test to illustrate the difference in query times between the two scenarios?
In each case, the SQL is the same, joining table A with table B. The only difference is whether or not table B contains any data.
Table A has these columns:
Column | Type
--------------+-----------------------------
sid | bigint
cluster | text
sn | text
tag_id | integer
src_ip | text
dst_ip | text
dst_port | integer
protocol | text
src_intf | text
dst_intf | text
disp | smallint
rcvd_bytes | bigint
sent_bytes | bigint
duration | integer
count | integer
start_time | timestamp without time zone
policy_id | text
src_user | text
dst_domain | text
app_id | text
signature_id | text
deny_type_id | text
reputation | text
wb_cat_id | text
alarm_name | text
virus | text
sender | text
recipients | text
host | text
dlp_rule_id | text
spam_type | text
spam_action | text
Table B has these columns:
Column | Type
--------------+-----------------------------
appliance_id | integer
ip | inet
fqdn | text
resolve_time | timestamp without time zone
expire_time | timestamp without time zone
The two tables are joined by an IPv4 address of type inet.
Test data populates tables A and B with just over 500 rows of data each.
With table B populated, on average, the test program that runs query takes 18.216 seconds of total time to run. The standard deviation between runs is 1.143 seconds.
With table B empty, on average, the test program that runs query takes 18.523 seconds of total time to run. The standard deviation between runs is 1.928 seconds.
The sample size in each case is six runs. I suspect I'll need to use a much larger sample size, but am not certain what an appropriate sample size would be.
I wrote a simple query using a LEFT JOIN:
SELECT * FROM a
LEFT JOIN b
ON a.src_ip::inet = b.ip
AND b.resolve_time IS NOT NULL AND b.resolve_time <= now()
AND b.expire_time IS NOT NULL AND now() < b.expire_time
I then ran 10 tests of 100 iterations (queries) each and timed the results of each test run.
Here are the results:
By averaging the run times and computing the ratio of the run time with Table B populated to the run time without Table B populated, I was able to compute an overhead of 21.38% for the query with Table B over the query without.