Search code examples
nlpprobabilitycontext-free-grammarhidden-markov-modelscyk

Extract probabilities and most likely parse tree from cyk


In order to understand cyk algorithm I've worked through example on : https://www.youtube.com/watch?v=VTH1k-xiswM&feature=youtu.be .

The result of which is :

enter image description here

How do I extract the probabilities associated with each parse and extract the most likely parse tree ?


Solution

  • These are two distinct problems for PCFG:

    • recognition: does the sentence belong to the language generated by the CFG? (output: yes or no)
    • parsing: what is the highest scoring tree for this sentence? (output: parse tree)

    The CKY algorithm video linked in the question only deals with the recognition problem. To perform the parsing problem simultaneously, we need to (i) maintain the score of each parsing item and (ii) keep track of the hierarchical relationships (e.g. if we use the rule S -> NP VP: we must keep track of which NP and which VP are used to predict S).

    Notations:

    • A parsing item [X, i, j]: s means that there is a node labelled X spanning token i (included) to j (excluded) with score s. The score is the log probability of the subtree rooted in X.
    • The sentence is a sequence of words w_1 w_2 ... w_n.

    At an abstract level, PCFG parsing can be formulated as a set of deduction rules:

    1. Scan Rule (read tokens)

      ____________________________{precondition: X -> w_i is a grammar rule
      [X, i, i+1]: log p(X -> w_i)
      

      Gloss: if there is a rule X -> w_i in the grammar, then we can add the item [X, i, i+1] in the chart.

    2. Complete Rule (create a new constituent bottom up)

      [X, i, k]: s1     [Y, k, j]: s2
      _____________________________________{precondition: Z -> X Y is a grammar rule
      [Z, i, j]: s1 + s2 + log p(Z -> X, Y)
      

      Gloss: if there are 2 parsing items [X, i, k] and [Y, k, j] in the chart, and a rule Z -> X Y in the grammar, then we can add [Z, i, j] to the chart.

    The goal of weighted parsing is to deduce the parsing item [S, 0, n]:s (S: axiom, n: length of sentence) with the highest score s.

    Pseudo code for whole algorithm

    # The chart stores parsing items and their scores
    chart[beginning(int)][end(int)][NonTerminal][score(float)]
    
    # the backtrack table is used to recover the parse tree at the end
    backtrack[beginning][end][NonTerminal][item_left, item_right]
    
    # insert a new item in the chart
    # for a given span (i, j) and nonterminal X, we only need to
    # keep the single best scoring item.
    def insert(X, i, j, score, Y, Z, k):
        if X not in chart[i][j] or chart[i][j][X] < score             
            chart[i][j][X] <- score
            backtrack[i][j][X] <- (Y, i, k), (Z, k, j)
    
    n <- length of sentence
    
    for i in range(0, n):
        # apply scan rule
        insert(X, i, i+1, log p(X -> w_i)) for each grammar rule X -> w_i
    
    for span_length in range(2, n):
        for beginning in range(0, n - span_length):
            end <- beginning + span_length
    
            for k in range(beginning+1, end -1):
                # apply completion rules
                for each grammar rule X -> Y Z such that 
                    * Y is in chart[beginning][k]
                    * Z is in chart[k][end]
    
                    score_left <- chart[beginning][k][Y]
                    score_right <- chart[k][end][Z]
    
                    insert(X, beginning, end, log p(X -> Y Z) + score_left + score_right)
    
    if there is S (axiom) in chart[0][n], then parsing is successful
        the score (log probability) of the best tree is chart[0][n][S]
        the best tree can be recovered recursively by following pointers from backtrack[0][n][S]
    else:
        parsing failure, the sentence does not belong to the language
        generated by the grammar
    

    Some notes:

    • Time complexity is O(n^3 ⋅ |G|) where |G| is the size of the grammar.
    • The algorithm assumes that the grammar is in Chomsky Normal Form