Search code examples
pythonpython-3.xdictionaryfrequencymarkov-chains

Calculating total and relative frequency of values in a dict representing a Markov-chain rule


I have made a function make_rule(text, scope=1) that simply goes over a string and generates a dictionary that serves as a rule for a Markovian text-generator (where the scope is the number of linked characters, not words).

>>> rule = make_rule("abbcad", 1)
>>> rule
{'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}

I have been tasked with calculating the entropy of this system. In order to do that I think I would need to know:

  1. How often a value appears in the dictionary in total, i.e. its total frequency.
  2. How often a value appears given a key in the dictionary, i.e. its relative frequency.

Is there a quick way to get both of these numbers for each of the values in the dictionary?

For the above example I would need this output:

'a' total: 1, 'a'|'a': 0, 'a'|'b': 0, 'a'|'c': 1
'b' total: 2, 'b'|'a': 1, 'b'|'b': 1, 'b'|'c': 0
'c' total: 1, 'c'|'a': 0, 'c'|'b': 1, 'c'|'c': 0
'd' total: 1, 'd'|'a': 1, 'a'|'b': 1, 'a'|'c': 1

I guess the 'a' total is easily inferred, so maybe instead just output a list of triples for every unique item that appears in the dictionary:

[[('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1)], [('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0)], ...]

Solution

  • I'll just deal with "How often a value appears given a key in the dictionary", since you've said that "How often a value appears in the dictionary in total" is easily inferred.

    If you just want to be able to look up the relative frequency of a value for a given key, it's easy to get that with a dict of Counter objects:

    from collections import Counter
    
    rule = {'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}
    
    freq = {k: Counter(v) for k, v in rule.items()}
    

    … which gives you a freq like this:

    {
        'a': Counter({'b': 1, 'd': 1}),
        'b': Counter({'b': 1, 'c': 1}),
        'c': Counter({'a': 1})
    }
    

    … so that you can get the relative frequency of 'a' given the key 'c' like this:

    >>> freq['c']['a']
    1
    

    Because Counter objects return 0 for nonexistent keys, you'll also get zero frequencies as you would expect:

    >>> freq['a']['c']
    0
    

    If you need a list of 3-tuples as specified in your question, you can get that with a little extra work. Here's a function to do it:

    def triples(rule):               
        freq = {k: Counter(v) for k, v in rule.items()}
        all_values = sorted(set().union(*rule.values()))      
        sorted_keys = sorted(rule)
        return [(v, k, freq[k][v]) for v in all_values for k in sorted_keys] 
    

    The only thing here which I think may not be self-explanatory is the all_values = ... line, which:

    1. creates an empty set()
    2. produces the union() of that set with all the individual elements of the lists in rule.values() (note the use of the argument-unpacking * operator)
    3. converts the result into a sorted() list.

    If you still have the original text, you can avoid all that work by using e.g. all_values = sorted(set(original_text)) instead.

    Here it is in action:

    >>> triples({'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']})
    [
        ('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1),
        ('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0),
        ('c', 'a', 0), ('c', 'b', 1), ('c', 'c', 0),
        ('d', 'a', 1), ('d', 'b', 0), ('d', 'c', 0)
    ]