I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs
with columns doc_id
and doc_content
.
I know that there is approximately 8 types of documents. Here are my questions:
I can probably use some like '%%'
or SIMILAR TO
, but there might be better approach.
You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).
For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein()
function - part of fuzzystrmatch
extension.