Search code examples
postgresqldata-miningtext-miningdocument-classification

Cluster text documents in database


I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs with columns doc_id and doc_content.

I know that there is approximately 8 types of documents. Here are my questions:

  • How can I find these groups?
  • Are there some similarity, dissimilarity measures I can use?
  • Is there some implementation of longest common substring in PostgreSQL?
  • Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)

I can probably use some like '%%' or SIMILAR TO, but there might be better approach.


Solution

  • You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).

    For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein() function - part of fuzzystrmatch extension.