Search code examples
python-3.xnlpspacydependency-parsing

NLP: Compare these two sentences. Is this a misclassification?


I am using the dependence parse of spacy. I am puzzled with these two very similar sentences.

Sentence 1:

text='He noted his father was a nice guy.'

Note that in this sentence "father" is clearly the subject of "father was a nice guy":

[(0, 'He', '-PRON-', 'PRON', 'PRP', 'nsubj'), (1, 'noted', 'note', 'VERB', 'VBD', 'ROOT'), (2, 'his', '-PRON-', 'DET', 'PRP$', 'poss'), (3, 'father', 'father', 'NOUN', 'NN', 'nsubj'), (4, 'was', 'be', 'VERB', 'VBD', 'ccomp'), (5, 'a', 'a', 'DET', 'DT', 'det'), (6, 'nice', 'nice', 'ADJ', 'JJ', 'amod'), (7, 'guy', 'guy', 'NOUN', 'NN', 'attr'), (8, '.', '.', 'PUNCT', '.', 'punct')]

        noted              
  ________|_____            
 |   |         was         
 |   |     _____|___        
 |   |  father     guy     
 |   |    |      ___|___    
 He  .   his    a      nice

for child in the_verb.children:
    print(child,child.dep_)
    
>> father nsubj
>> guy attr

for ancestor in the_verb.ancestors:
    print(ancestor,ancestor.dep_)
    
>> noted ROOT

Sentence 2:

text='He noted his father, as \"a man with different attributes\", was a nice guy.'

This is a minor variation of the previous sentence. "father" is not the subject anymore.

[(0, 'He', '-PRON-', 'PRON', 'PRP', 'nsubj'), (1, 'noted', 'note', 'VERB', 'VBD', 'ROOT'), (2, 'his', '-PRON-', 'DET', 'PRP$', 'poss'), (3, 'father', 'father', 'NOUN', 'NN', 'dobj'), (4, ',', ',', 'PUNCT', ',', 'punct'), (5, 'as', 'as', 'ADP', 'IN', 'prep'), (6, '"', '"', 'PUNCT', '``', 'punct'), (7, 'a', 'a', 'DET', 'DT', 'det'), (8, 'man', 'man', 'NOUN', 'NN', 'pobj'), (9, 'with', 'with', 'ADP', 'IN', 'prep'), (10, 'different', 'different', 'ADJ', 'JJ', 'amod'), (11, 'attributes', 'attribute', 'NOUN', 'NNS', 'pobj'), (12, '"', '"', 'PUNCT', "''", 'punct'), (13, ',', ',', 'PUNCT', ',', 'punct'), (14, 'was', 'be', 'VERB', 'VBD', 'conj'), (15, 'a', 'a', 'DET', 'DT', 'det'), (16, 'nice', 'nice', 'ADJ', 'JJ', 'amod'), (17, 'guy', 'guy', 'NOUN', 'NN', 'attr'), (18, '.', '.', 'PUNCT', '.', 'punct')]

                noted                                 
  ________________|____________________________        
 |   |   |   |    |         as                 |      
 |   |   |   |    |         |                  |       
 |   |   |   |    |        man                 |      
 |   |   |   |    |      ___|______            |       
 |   |   |   |    |     |   |     with        was     
 |   |   |   |    |     |   |      |           |       
 |   |   |   |  father  |   a  attributes     guy     
 |   |   |   |    |     |   |      |        ___|___    
 He  ,   ,   .   his    "   "  different   a      nice


the_verb=spacy_doc[14]

for child in the_verb.children:
    print(child,child.dep_)
    
>> guy attr

for ancestor in the_verb.ancestors:
    print(ancestor,ancestor.dep_)
    
>> noted ROOT

I am trying to understand how spacy classifies the sentences. Is the second case a misclassification error? I mean "father" should still be the subject?


Solution

  • I wonder if you are thinking of a parse tree instead of a dependency tree...

    I've always been confused by dependency trees, to be honest. They are good at identifying relative connections between structures but I don't think they are that good at determining absolute semantic structures, for example. Phrase structure rules are quite good at determining the absolute parts-of-speech of specifically nouns, verbs, and their constituents; although still imperfectly. While a dependency parser can be used to detect noun chunks, and prepositional phrases, and infer verb phrases, I don't think that's its main function. That is the main function of a parse tree though.

    To return to your question:

    The way you're talking about "father" being the subject sounds like you're trying to understand the deep syntactic structure (absolute) but using a relative model (dependency parser).

    In essence, I believe having the phrase', as "a man with different attributes", ' is adding layers to the dependency tree. These layers separate the actual subject "his father" from the verb phrase "was a good man". I'd imagine it's adding a layer for the commas, another layer for the quotes, another layer for the as-clause. Until eventually, the relative relationship that the dependency parser is supposed to be determining gets "too far".

    The syntactic analysis can only be as good as the models that generate them. In fact, You'll see that SpaCy has 2 POS indicators that both attempt to perform a syntactic analysis. One is generated by the dependency parser (available under token.dep_) and the other is generated by a statistical model (available under token.pos_). You'll also see that these POS indicators do not always match due to the imprecise nature of the models that predict them.

    Out of interest, I believe NLTK has a more traditional phrase-structure-rules-based parse tree available; although even these have limitations. If you want deep, hard-core syntactic analyses of real-life sentences, you may want to try something like Head-driven phrase structure grammar (HPSG) but you'll see that things start to get just a little bit technical. :)