Senna is a NLP tool built using neural nets and it's able to do:
After downloading the pre-compiled package from http://ml.nec-labs.com/senna/download.html
I ran the --help
menu and see what are the options:
alvas@ubi:~/senna$ ./senna-linux64 --help
invalid argument: --help
SENNA Tagger (POS - CHK - NER - SRL)
(c) Ronan Collobert 2009
Usage: ./senna-linux64 [options]
Takes sentence (one line per sentence) on stdin
Outputs tags on stdout
Typical usage: ./senna-linux64 [options] < inputfile.txt > outputfile.txt
Display options:
-h Display this help
-verbose Display model informations on stderr
-notokentags Do not output tokens
-offsettags Output start/end offset of each token
-iobtags Output IOB tags instead of IOBES
-brackettags Output 'bracket' tags instead of IOBES
Data options:
-path <path> Path to the SENNA data/ and hash/ directories [default: ./]
Input options:
-usrtokens Use user's tokens (space separated) instead of SENNA tokenizer
SRL options:
-posvbs Use POS verbs instead of SRL style verbs for SRL task
-usrvbs <file> Use user's verbs (given in <file>) instead of SENNA verbs for SRL task
Tagging options:
-pos Output POS
-chk Output CHK
-ner Output NER
-srl Output SRL
-psg Output PSG
The command-line interface is straight forward and the outputs for POS and NER tags are also easy to interpret.
Given this input:
alvas@ubi:~/senna$ cat test.in
Foo went to eat bar at the Foobar.
This is out standard Penn Treebank tagset:
alvas@ubi:~/senna$ ./senna-linux64 -pos < test.in
Foo NNP
went VBD
to TO
eat VB
bar NN
at IN
the DT
Foobar NNP
. .
And this is the BIO tagset:
alvas@ubi:~/senna$ ./senna-linux64 -ner < test.in
Foo S-PER
went O
to O
eat O
bar O
at O
the O
Foobar S-LOC
. O
And for the chunking it's also some sort of the BIOE tagset we're used to:
alvas@ubi:~/senna$ ./senna-linux64 -chk < test.in
Foo S-NP
went B-VP
to I-VP
eat E-VP
bar S-NP
at S-PP
the B-NP
Foobar E-NP
. O
But what does the S-
tags mean? It seems like it's only attached to tokens that are single token chunks, is that true?
The SRL tags are a little weird, they are multiple-annotations per token:
alvas@ubi:~/senna$ ./senna-linux64 -srl < test.in
Foo - S-A1 S-A0
went went S-V O
to - B-AM-PNC O
eat eat I-AM-PNC S-V
bar - I-AM-PNC S-A1
at - I-AM-PNC B-AM-LOC
the - I-AM-PNC I-AM-LOC
Foobar - E-AM-PNC E-AM-LOC
. - O O
The look like the "tuple-like" outputs we get from semantic frames but I don't understand the conventions, e.g. what is -AM-
? what is -PNC
?
What does the output mean and how should we interpret it?
And for the Parser output:
alvas@ubi:~/senna$ ./senna-linux64 -psg < test.in
Foo (S1(S(NP*)
went (VP*
to (S(VP*
eat (VP*
bar (ADVP*)
at (PP*
the (NP*
Foobar *))))))
. *))
It looks like the bracketed parse output we see in parsing but what does the *
mean?
SENNA uses the CoNLL format. You can read about it here: http://universaldependencies.github.io/docs/format.html
It's rather common and there are plenty of converters around.
As for the prefixes they mean: S- singleton expressions and B- begin I- intermediate E- end of a multi word expression.
Then there is the output of the semantic role labeling. Look for more information on SRL as this gets a little more complex. Notice there are two columns, one for the verb go and one for the verb eat. Usually A0 is the subject and A1 the direct object (again, oversimplified). AM is the argument modifier and -LOC is a location (it could be other adverbs). PNC seems to refer to the surrogate noun phrase acting as object of the verb go. Don't remember from the top of my head. Examples here verbs.colorado.edu/propbank/framesets-english/go-v.html As for the parse tree, it's bracketed and also a common notation loosely inspired by Lisp. The * indicates the label of the current token. I found this useful: https://math.stackexchange.com/questions/588230/how-to-convert-parentheses-notation-for-trees-into-an-actual-tree-drawing