Search code examples
pythonmachine-learningnltktext-classificationnaivebayes

How to build a text classifier for words?


I am trying to classify a large number of words into 5 categories. Examples of classes and strings for each class include:

invoice-Number : "inv123","in12","123"
invoice-Date   : "22/09/1994","22-Mon-16"
vendor-Name    : "samplevendorname"
email          : "[email protected]"
net-amount     : "1234.56"

Any pointers to achieve this in python is very much appreciated.

EDIT 1: I'm looking for a machine learning approach as the number of classes will be more and the data in each class will be different so regex is not feasible.


Solution

  • You can start with a based idea of BoW (Bag of Word) but modify to BoC (Bac of character) with a tokenizer that doesn't remove any character and build a dictionary of n-grams for 1 to 4 characters.

    After that you can represent any word as a vector, that can be counter the number of presences, yes or not presence or tfidf.

    Then build your model and pass the words-vector to it for learn. You can study the cross label of the n-grams to discard the ones that make noise in the dataset.

    I hope this helps for a start point.