sql oracle-database plsql duplicates clob

Count CLOB Duplicates in a large Oracle Table

I have an Oracle database table LOG_MESSAGES with a CLOB column called MESSAGE. Some of the rows contain the same MESSAGE.

For each MESSAGE which has at least a duplicate, I'd like to know the number of duplicates. Quite a number of these CLOBs are huge (> 100 kB), so converting to VARCHAR2 is out of question. Since many traditional methods such as GROUP BY do not work with CLOB, could someone please enlighten me?

For information, the table is very large (around 1 TB). So an optimised method would very much be appreciated.

Thank you in advance for your help.

Solution

I think this question gets asked a lot but unfortunately there doesn't seem to be a perfect way of doing this. There are ways that work just fine though.

Search for "clob group by" or "clob distinct" and you will see several hits just on this website.

One way would be to write a PL/SQL script that does a DBMS_LOB.COMPARE between all clobs in the table but the efficiency would probably be in the order of O(n^2) which would make it really slow for your purpose.

Another approach that is well accepted is to take a hash value of the clob using dbms_crypto (i think that allows hashing on clobs) and then group by on the hash values. There is a possibility of hash collision, but the probability is minute. I read somewhere around 2^80 (number might be wrong though). This won't be as slow as the first approach but calculating a hash would also take non-negligible time.

I would suggest try the hash approach first and if that seems too slow, look for alternatives.