Search code examples
pythonnlpviterbi

What is the best data structure for an emission probability table?


For my project I have a dataset of words(e.g. dog, ran, cat) and each word is tagged with a part of speech (e.g. verb, noun, adjective). I need to create a data structure which stores the totals that each word will be a certain part of speech. I am currently using a 3d array, with a word being the first element in each array, and then the parts of speech coming afterward, with the total number of instances following each respective pos. Here is an example below.

emissiontable = [[Fight, [Verb, 100], [Noun, 120]], [Run,[Verb,100],[Noun,120]]]

This just seems very tedious and there's probably a better way to do it. And especially since I will have to convert each of the totals to probabilities(probability that x word is an x part of speech). This is also called an emission probability table. Is there a better data structure for this?


Solution

  • I would use dictionaries instead

    https://www.w3schools.com/python/python_dictionaries.asp