Say I have a huge set of documents represented in relational Table with columns
ID (unique identifier)
Title (255 characters)
Description (5000 characters)
Category (predefined meta-data )
Additional Notes (1000 characters )
I would like to add one or more Tags for each row in the document table. Here Tags refer to a word or a group of words that tells readers what a document is about.
Is there any data-mining/text-mining/machine learning techniques or approach that will help me to find the most appropriate Tags for a given document without human interference.
One of the simple possible approaches: for a given document calculate TF-IDF measure for every word and choose top-N words as tags (or cut candidates by some threshold). Also in your case it's reasonable to use empirical boosting coefficients for words in the Title and Category fields.