I'm looking specifically for some data structure, enum, or generative process through which the different parts-of-speech are represented internally. I've spent a long time scanning the Javadoc and the source code for a while and can't find what I'm looking for. I would like to access a collection of the tags directly, if possible, if they're stored in some central location. Please forgive me if the question I'm posing constitutes a naive assumption regarding the way CoreNLP pos-tagging operates, but if what I'm describing does exist in some form, this would be very helpful. Thanks!
I'm not actually sure they're represented explicitly anywhere in the code. The tagger simply outputs them as Strings rather than any sort of fixed enum, and the output space is inferred directly from the training data. The advantage of this being that you can train the exact same model on arbitrary tag sets. And of course the disadvantage you've just run into. :)
However, for English, the tag set should be the Penn Treebank tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html