python search graph nlp artificial-intelligence

Using Beam Search on graph to generate sentence with highest score

I'm working with a graph that I extracted from a long article. The directed weighted graph, which is really nothing more than a dictionary, contains head words as vertices which are connected via edges to each word (tail words) that follows that word in the article. So, if the word "yellow" appears 3 times in the article, and it's followed by the words "brick", "brick", and "submarine", than the "yellow" entry would be represented like so in the graph:

{"yellow": ["brick", "brick", "submarine"]}

This graph was generated using a python Class that I wrote called ExtractedGraph that, aside from an __init__ method that does the work of generating the graph, has a getProb(self, head_word, tail_word) method that takes as input the head word and the tail word and outputs the probability the head word will follow the tail word, which is the weight of the edge connecting the head word node and tail word node. So, if we input "yellow" and "brick", the output would be 2/3.

My question is, how would one do a beam search of this graph to find the sentence with the max score. Specifically, what if the input to the beam search function was a prefix_words string, a beam_width int, and a sen_length int (the max word length of a sentence). How would the algorithm look? After reading about the Beam Search algorithm online and watching numerous tutorials, I'm not sure how the Beam Search function would really work in this specific scenario.

Solution

Let's say graph_nodes is the dictionary and every sentence must start with the <s> symbol with probability 1.0 and all sentences must end with a special symbol </s>. To avoid sorting the hypotheses, I keep them in a heap, so adding an element is constant.

import heapq

beam = [(1.0, ["<s>"])]
for _ in range(sen_length):
    new_beam = []
    for score, hypothesis in beam:
        hypothesis_end = hypothesis[-1]

        # finished hypothesis will return to the beam and compete with new ones
        if hypothesis_end == "</s>":
            heapq.heappush(new_beam, (score, hypothesis))
            if len(new_beam) > beam_width:
                heapq.heappop(new_beam)

        # expand unfinished hypothesis
        for possible_continuation in graph_nodes[hypothesis_end]:
            continuation_score = score * get_prob(hypothesis_end, possible_continuation)
            heapq.heappush(
                new_beam, (continuation_score, hypothesis + [possible_continuation])
            if len(new_beam) > beam_width:
                heapq.heappop(new_beam)
    beam = new_beam

If your hypotheses can have different lengths, you should consider some length normalization (e.g., a geometric mean of the probabilities). Also, multiplying probabilities might not always be numerically stable, so you might want to use sums of logarithms instead.