I know this question is more of a grammar question however how do you determine a "subject" of a sentence if you have a array of Penn Treebank
tokens like:
[WP][VBZ][DT][NN]
Is there any java library that can take in such tokens and determine which one is the subject? Or which ones?
I have been successfully classifying subjects for Portuguese using OpenNLP. I created a shallow parser tweaking a little the OpenNLP Chunker component.
You can use the existing OpenNLP models for pos tagging and chunking, but you will train a new chunk model that takes the PoS tags + chunk tags to classify subjects.
The data format to train the Chunker is based on Conll 2000:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
...
I then created a new corpus that looks like the following
He PRP+B-NP B-SUBJ
reckons VBZ+B-VP B-V
the DT+B-NP O
current JJ+I-NP O
account NN+I-NP O
deficit NN+I-NP O
will MD+B-VP O
narrow VB+I-VP O
If you have access to Penn Treebank you can create such data by looking for subject nodes in the corpus. Maybe you can start with this Perl script used to generate the data for the CoNLL-2000 Shared Task.
The evaluation results for Portuguese are 87.07 % for precision, 75.48 % for recall, and 80.86 % for F1.