I am looking at Stanford NER and want to know how the words are represented. Are they converted to vectors using Word2Vec or Glove when training the model using linear CRF.
A Little more study shows me that the data is stored into a CRFDatum structure. Can anyone please elaborate on this?
Well, now I know how the old-school AI people feel...
Back in the Old Days (including when the NER system was built), before neural networks took off, statistical ML converted discrete outputs into vectors using custom-built featurizers. For language, this usually resulted in a very long but sparse vector of one-hot features. For example, a featurizer might assign each word a one-hot representation: 1 at the index corresponding to the word, and zero elsewhere. For NER, these features were usually things like the characters in the word (one-hot encoded), prefixes and suffixes of length $k$, word shape, part-of-speech tag, etc.
In Stanford's code, these sparse vectors are usually represented as Counter
objects of one form or another, which then get passed into a Datum
object and converted into a more densely packed Dataset
object, which is fed into the optimizer (usually, QNMinimizer
, implementing L-BFGS).