I am trying to classify a large number of words into 5 categories. Examples of classes and strings for each class include:
invoice-Number : "inv123","in12","123"
invoice-Date : "22/09/1994","22-Mon-16"
vendor-Name : "samplevendorname"
email : "[email protected]"
net-amount : "1234.56"
Any pointers to achieve this in python is very much appreciated.
EDIT 1: I'm looking for a machine learning approach as the number of classes will be more and the data in each class will be different so regex is not feasible.
You can start with a based idea of BoW (Bag of Word) but modify to BoC (Bac of character) with a tokenizer that doesn't remove any character and build a dictionary of n-grams for 1 to 4 characters.
After that you can represent any word as a vector, that can be counter the number of presences, yes or not presence or tfidf.
Then build your model and pass the words-vector to it for learn. You can study the cross label of the n-grams to discard the ones that make noise in the dataset.
I hope this helps for a start point.