I was trying to write an optimized SQL query to count a specific XML node in clob type column, for now I'm using a REGEXP_COUNT to browse through all rows and all XMLs, but this is terribly slow.
Like this:
SELECT sum(REGEXP_COUNT(clobtype_column_withXMLs, '/Node_Closing_tag', 1)) as 'Node_number' FROM MY_TABLE;
This works, but with hundreds of millions of tags it takes literally ages.
I want to bypass the XML structure and shorten the count time. I have found some things, but most of them involve changing the column type to xml and use xquery and other DML functions.
In my case I can't modify the table, I only need the number, how many times the specified node occurs in the column.
My table structure:
not an XML Table, various column types and a CLOB type column
, in which every row contains a different sized XML (hundreds of millions of characters).
The XML has pretty complex structure, the one node I'm searching for is in the middle:
1/2/3/4/5/I'm_searching_for_this/contains_multiple_nodes
Is there any fast way to count nodes in a structure like this, without modifying anything in the table? With fast I mean, count 500 million nodes in a couple of hours maximum. (text search takes 24h+)
The replace and measure length difference trick shoudl work for you with the same result as the REGEXP_COUNT
(not necessary on the same precision when you parse XML and counts elements, which should have the ultimate precision).
The advantage of the REPLACE
approach is that it will be the fastest implementation.
Simple
replace each occurence of the searched substring in the original string with NULL
calculate the difference of length of the original string and the replaced one
divide the result with the length of the substring to get the result count
.
select
sum(REGEXP_COUNT(clobtype_column_withXMLs, '/Node_Closing_tag', 1)) cnt1,
sum( (length(clobtype_column_withXMLs) - length(replace(clobtype_column_withXMLs,'/Node_Closing_tag',null))) / length('/Node_Closing_tag')) cnt2
from tst
I'd recomend to perform some test using different search substrings
/tag
tag/
<tag>
</tag>
<tag/>
and compare the results to see if this imethod is reliable.