Search code examples
python-3.xvalidationtokenlink-grammar

How to find invalid Link Grammar tokens?


I'd like to use the Link Grammar Python3 bindings for a simple grammar checker. While the linkage API is relatively well-documented, there doesn't seem to be way to access all tokens that prevent linkages.

This is what I have so far:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from linkgrammar import Sentence, ParseOptions, Dictionary, __version__
print('Link Grammar Version:', __version__)

for sentence in ['This is a valid sample sentence.', 'I Can Has Cheezburger?']:
    sent = Sentence(sentence, Dictionary(), ParseOptions())
    linkages = sent.parse()
    if len(linkages) > 0:
        print('Valid:', sentence)
    else:
        print('Invalid:', sentence)

(I used link-grammar-5.4.3 for my tests.)

When I analyzed the invalid sample sentence using the Link Parser command line tool, I got the following output:

linkparser> I Can Has Cheezburger?
No complete linkages found.
Found 1 linkage (1 had no P.P. violations) at null count 1
    Unique linkage, cost vector = (UNUSED=1 DIS= 0.10 LEN=7)

    +------------------Xp------------------+
    +------------->Wa--------------+       |
    |            +---G--+-----G----+       |
    |            |      |          |       |
LEFT-WALL [I] Can[!] Has[!] Cheezburger[!] ?

How do I get all potentially invalid tokens marked with [!] or [?] with Python3?


Solution

  • See how it is done in bindings/python-examples/sentence-check.py. It is better to look at the latest repo version (the current one is here), as there was a bug in this demo program at 5.4.3.

    Specifically, the following extracts the word list:

    words = list(linkage.words())
    

    Unlinked words are wrapped within []. Words which have [] appended to them are guessed ones. For example, [!] means that the word has been classified by a regex (that appears in the file 4.0.regex) and this classification has then been looked up in the dictionary. If you set the parse-option display_morphology to True, the classifying regex name appears after the !.

    Here is the full legend of the word output format:

     [word]            Null-linked word
     word[!]           word classified by a regex
     word[!REGEX_NAME] word classified by REGEX_NAME (turn on by morphology=1)
     word[~]           word generated by a spell guess (unknown original word)
     word[&]           word run-on separated by a spell guess
     word[?]           word is unknown (looked up in the dict as UNKNOWN-WORD)
     word.POS          word found in the dictionary as word.POS
     word.#CORRECTION  word is probably a typo - got linked as CORRECTION
    
    For dictionaries that support morphology (turn on by morphology=1):
     word=             A prefix morpheme
     =word             A suffix morpheme
     word.=            A stem
    

    It may be useful to match the output words to the original sentence words, especially in case of spell corrections or when morphology is turned on. The said demo program sentence-check.py does that when you call it with -p - see the code under if arg.position:.

    In the case of your demo sentence I Can Has Cheezburger?, only the word I has no linkage, and the other words have been classified as capitalized-words and got linked as proper nouns (the G link type).

    You can find more information on the link types in summarize-links.