python python-3.x dictionary frequency markov-chains

Calculating total and relative frequency of values in a dict representing a Markov-chain rule

I have made a function make_rule(text, scope=1) that simply goes over a string and generates a dictionary that serves as a rule for a Markovian text-generator (where the scope is the number of linked characters, not words).

>>> rule = make_rule("abbcad", 1)
>>> rule
{'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}

I have been tasked with calculating the entropy of this system. In order to do that I think I would need to know:

How often a value appears in the dictionary in total, i.e. its total frequency.
How often a value appears given a key in the dictionary, i.e. its relative frequency.

Is there a quick way to get both of these numbers for each of the values in the dictionary?

For the above example I would need this output:

'a' total: 1, 'a'|'a': 0, 'a'|'b': 0, 'a'|'c': 1
'b' total: 2, 'b'|'a': 1, 'b'|'b': 1, 'b'|'c': 0
'c' total: 1, 'c'|'a': 0, 'c'|'b': 1, 'c'|'c': 0
'd' total: 1, 'd'|'a': 1, 'a'|'b': 1, 'a'|'c': 1

I guess the 'a' total is easily inferred, so maybe instead just output a list of triples for every unique item that appears in the dictionary:

[[('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1)], [('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0)], ...]

Solution

I'll just deal with "How often a value appears given a key in the dictionary", since you've said that "How often a value appears in the dictionary in total" is easily inferred.

If you just want to be able to look up the relative frequency of a value for a given key, it's easy to get that with a dict of Counter objects:

from collections import Counter

rule = {'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}

freq = {k: Counter(v) for k, v in rule.items()}

… which gives you a freq like this:

{
    'a': Counter({'b': 1, 'd': 1}),
    'b': Counter({'b': 1, 'c': 1}),
    'c': Counter({'a': 1})
}

… so that you can get the relative frequency of 'a' given the key 'c' like this:

>>> freq['c']['a']
1

Because Counter objects return 0 for nonexistent keys, you'll also get zero frequencies as you would expect:

>>> freq['a']['c']
0

If you need a list of 3-tuples as specified in your question, you can get that with a little extra work. Here's a function to do it:

def triples(rule):               
    freq = {k: Counter(v) for k, v in rule.items()}
    all_values = sorted(set().union(*rule.values()))      
    sorted_keys = sorted(rule)
    return [(v, k, freq[k][v]) for v in all_values for k in sorted_keys]

The only thing here which I think may not be self-explanatory is the all_values = ... line, which:

creates an empty set()
produces the union() of that set with all the individual elements of the lists in rule.values() (note the use of the argument-unpacking * operator)
converts the result into a sorted() list.

If you still have the original text, you can avoid all that work by using e.g. all_values = sorted(set(original_text)) instead.

Here it is in action:

>>> triples({'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']})
[
    ('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1),
    ('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0),
    ('c', 'a', 0), ('c', 'b', 1), ('c', 'c', 0),
    ('d', 'a', 1), ('d', 'b', 0), ('d', 'c', 0)
]