I'm working with a graph that I extracted from a long article. The directed weighted graph, which is really nothing more than a dictionary, contains head words as vertices which are connected via edges to each word (tail words) that follows that word in the article. So, if the word "yellow" appears 3 times in the article, and it's followed by the words "brick", "brick", and "submarine", than the "yellow" entry would be represented like so in the graph:
{"yellow": ["brick", "brick", "submarine"]}
This graph was generated using a python Class that I wrote called ExtractedGraph
that, aside from an __init__
method that does the work of generating the graph, has a getProb(self, head_word, tail_word)
method that takes as input the head word and the tail word and outputs the probability the head word will follow the tail word, which is the weight of the edge connecting the head word node and tail word node. So, if we input "yellow" and "brick", the output would be 2/3.
My question is, how would one do a beam search of this graph to find the sentence with the max score. Specifically, what if the input to the beam search function was a prefix_words
string, a beam_width
int, and a sen_length
int (the max word length of a sentence). How would the algorithm look? After reading about the Beam Search algorithm online and watching numerous tutorials, I'm not sure how the Beam Search function would really work in this specific scenario.
Let's say graph_nodes
is the dictionary and every sentence must start with the <s>
symbol with probability 1.0 and all sentences must end with a special symbol </s>
. To avoid sorting the hypotheses, I keep them in a heap, so adding an element is constant.
import heapq
beam = [(1.0, ["<s>"])]
for _ in range(sen_length):
new_beam = []
for score, hypothesis in beam:
hypothesis_end = hypothesis[-1]
# finished hypothesis will return to the beam and compete with new ones
if hypothesis_end == "</s>":
heapq.heappush(new_beam, (score, hypothesis))
if len(new_beam) > beam_width:
heapq.heappop(new_beam)
# expand unfinished hypothesis
for possible_continuation in graph_nodes[hypothesis_end]:
continuation_score = score * get_prob(hypothesis_end, possible_continuation)
heapq.heappush(
new_beam, (continuation_score, hypothesis + [possible_continuation])
if len(new_beam) > beam_width:
heapq.heappop(new_beam)
beam = new_beam
If your hypotheses can have different lengths, you should consider some length normalization (e.g., a geometric mean of the probabilities). Also, multiplying probabilities might not always be numerically stable, so you might want to use sums of logarithms instead.