I have an SQL query to the Oracle database (The query links the elements to the clusters related by the clusters by having the maximum average connection to each element of the cluster. The relationship of two elements is defined by number of links they have in common. We calculate the relation of the given element to the each element of the cluster, then take the relation of the element to the cluster as average of relations of this elements to all elements of the cluster, then pick the cluster to which our element has maximal relation and add the element to the cluster )
INSERT INTO t_clusters (transaction_id,
CLUSTER_ID,
word_id,
connection)
SELECT 72,
CLUSTER_ID,
word_id,
avg_cnt
FROM ( SELECT word_id, CLUSTER_ID, AVG (cnt) avg_cnt
FROM ( SELECT a.word_id,
b.word_id bword,
c.CLUSTER_ID,
COUNT (*) cnt
FROM t_semantic a,
t_words a1,
t_semantic b,
t_clusters c
WHERE a.transaction_id = 72
AND a.word_id = a1.word_id
AND a1.checked = 0
AND CLUSTER_ID <> 0
AND a.link_id = b.link_id
AND b.word_id = c.word_id
GROUP BY a.word_id, b.word_id, c.CLUSTER_ID)
GROUP BY word_id, CLUSTER_ID
HAVING AVG (cnt) >= 3) zz
WHERE avg_cnt = (SELECT MAX (avg_cnt)
FROM ( SELECT AVG (cnt) avg_cnt, word_id
FROM ( SELECT a.word_id,
b.word_id bword,
c.CLUSTER_ID,
COUNT (*) cnt
FROM t_semantic a,
t_words a1,
t_semantic b,
t_clusters c
WHERE a.transaction_id = 72
AND a.word_id = a1.word_id
AND a1.checked = 0
AND CLUSTER_ID <> 0
AND a.link_id = B.link_id
AND b.word_id = c.word_id
GROUP BY a.word_id,
b.word_id,
c.CLUSTER_ID)
GROUP BY word_id, CLUSTER_ID
HAVING AVG (cnt) > 3) zzy
WHERE zzy.word_id = zz.word_id)
It executing already 23 hours
The number of records in the table of relations (t_semantic) is 532125 The number of records in the table of clusters (t_clusters) where stored elements, already connected to clusters is 23014 The number of records in the table with elements not connected to clusters yet (t_words) is 36111
Fields word_id, cluster_id, link_id indexed in all table. Is it any way to improve the perfomance of this query and why it's so slow?
Plan
INSERT STATEMENT ALL_ROWSCost: 9,234
28 FILTER
10 HASH GROUP BY Cost: 9,234 Bytes: 46,000 Cardinality: 2,000
9 VIEW Cost: 9,234 Bytes: 8,599,493 Cardinality: 373,891
8 SORT GROUP BY Cost: 9,234 Bytes: 15,329,531 Cardinality: 373,891
7 HASH JOIN Cost: 5,215 Bytes: 15,329,531 Cardinality: 373,891
1 TABLE ACCESS FULL TABLE DKIM.T_WORDS Cost: 86 Bytes: 286,568 Cardinality: 35,821
6 HASH JOIN Cost: 5,117 Bytes: 20,274,408 Cardinality: 614,376
4 HASH JOIN Cost: 2,245 Bytes: 4,108,540 Cardinality: 205,427
2 INDEX FAST FULL SCAN INDEX (UNIQUE) DKIM.UK_CLUSTER Cost: 333 Bytes: 229,140 Cardinality: 22,914
3 TABLE ACCESS FULL TABLE DKIM.T_SEMANTIC Cost: 1,903 Bytes: 5,277,070 Cardinality: 527,707
5 TABLE ACCESS FULL TABLE DKIM.T_SEMANTIC Cost: 1,909 Bytes: 6,859,632 Cardinality: 527,664
27 SORT AGGREGATE Bytes: 26 Cardinality: 1
26 VIEW Cost: 18 Bytes: 26 Cardinality: 1
25 FILTER
24 SORT GROUP BY Cost: 18 Bytes: 23 Cardinality: 1
23 VIEW Cost: 18 Bytes: 23 Cardinality: 1
22 SORT GROUP BY Cost: 18 Bytes: 41 Cardinality: 1
21 TABLE ACCESS BY INDEX ROWID TABLE DKIM.T_CLUSTERS Cost: 4 Bytes: 10 Cardinality: 1
20 NESTED LOOPS Cost: 17 Bytes: 41 Cardinality: 1
18 NESTED LOOPS Cost: 9 Bytes: 93 Cardinality: 3
15 NESTED LOOPS Cost: 5 Bytes: 21 Cardinality: 1
12 TABLE ACCESS BY INDEX ROWID TABLE DKIM.T_WORDS Cost: 2 Bytes: 8 Cardinality: 1
11 INDEX UNIQUE SCAN INDEX (UNIQUE) DKIM.T_WORDS_PK Cost: 1 Cardinality: 1
14 TABLE ACCESS BY INDEX ROWID TABLE DKIM.T_SEMANTIC Cost: 3 Bytes: 13 Cardinality: 1
13 INDEX RANGE SCAN INDEX DKIM.IND_SEMANTIC_WORD_ID Cost: 2 Cardinality: 1
17 TABLE ACCESS BY INDEX ROWID TABLE DKIM.T_SEMANTIC Cost: 4 Bytes: 30 Cardinality: 3
16 INDEX RANGE SCAN INDEX DKIM.IND_SEMANTIC_LINK_ID Cost: 2 Cardinality: 3
19 INDEX RANGE SCAN INDEX DKIM.CLUSTERS_WORD_ID Cost: 2 Cardinality: 1
Using analytic functions to determine the max avg_cnt might help speed things up, e.g.:
insert into t_clusters (transaction_id,
cluster_id,
word_id,
connection)
select 72,
cluster_id,
word_id,
avg_cnt
from (select word_id,
cluster_id,
avg(cnt) avg_cnt,
max(avg(cnt)) over (partition by word_id) max_avg_cnt
from (select a.word_id,
b.word_id bword,
c.cluster_id,
count (*) cnt
from t_semantic a,
t_words a1,
t_semantic b,
t_clusters c
where a.transaction_id = 72
and a.word_id = a1.word_id
and a1.checked = 0
and cluster_id <> 0
and a.link_id = b.link_id
and b.word_id = c.word_id
group by a.word_id, b.word_id, c.cluster_id)
group by word_id, cluster_id
having avg(cnt) >= 3)
where avg_cnt = max_avg_cnt;