Finding a "subject" from an array of part of speech tags

I know this question is more of a grammar question however how do you determine a "subject" of a sentence if you have a array of Penn Treebank tokens like:

[WP][VBZ][DT][NN]

Is there any java library that can take in such tokens and determine which one is the subject? Or which ones?

Solution

I have been successfully classifying subjects for Portuguese using OpenNLP. I created a shallow parser tweaking a little the OpenNLP Chunker component.

You can use the existing OpenNLP models for pos tagging and chunking, but you will train a new chunk model that takes the PoS tags + chunk tags to classify subjects.

The data format to train the Chunker is based on Conll 2000:

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
...

I then created a new corpus that looks like the following

He        PRP+B-NP  B-SUBJ
reckons   VBZ+B-VP  B-V  
the       DT+B-NP   O
current   JJ+I-NP   O
account   NN+I-NP   O
deficit   NN+I-NP   O
will      MD+B-VP   O
narrow    VB+I-VP   O

If you have access to Penn Treebank you can create such data by looking for subject nodes in the corpus. Maybe you can start with this Perl script used to generate the data for the CoNLL-2000 Shared Task.

The evaluation results for Portuguese are 87.07 % for precision, 75.48 % for recall, and 80.86 % for F1.