I have made a function make_rule(text, scope=1)
that simply goes over a string and generates a dictionary that serves as a rule for a Markovian text-generator (where the scope is the number of linked characters, not words).
>>> rule = make_rule("abbcad", 1)
>>> rule
{'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}
I have been tasked with calculating the entropy of this system. In order to do that I think I would need to know:
Is there a quick way to get both of these numbers for each of the values in the dictionary?
For the above example I would need this output:
'a' total: 1, 'a'|'a': 0, 'a'|'b': 0, 'a'|'c': 1
'b' total: 2, 'b'|'a': 1, 'b'|'b': 1, 'b'|'c': 0
'c' total: 1, 'c'|'a': 0, 'c'|'b': 1, 'c'|'c': 0
'd' total: 1, 'd'|'a': 1, 'a'|'b': 1, 'a'|'c': 1
I guess the 'a'
total is easily inferred, so maybe instead just output a list of triples for every unique item that appears in the dictionary:
[[('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1)], [('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0)], ...]
I'll just deal with "How often a value appears given a key in the dictionary", since you've said that "How often a value appears in the dictionary in total" is easily inferred.
If you just want to be able to look up the relative frequency of a value for a given key, it's easy to get that with a dict
of Counter
objects:
from collections import Counter
rule = {'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']}
freq = {k: Counter(v) for k, v in rule.items()}
… which gives you a freq
like this:
{
'a': Counter({'b': 1, 'd': 1}),
'b': Counter({'b': 1, 'c': 1}),
'c': Counter({'a': 1})
}
… so that you can get the relative frequency of 'a'
given the key 'c'
like this:
>>> freq['c']['a']
1
Because Counter
objects return 0 for nonexistent keys, you'll also get zero frequencies as you would expect:
>>> freq['a']['c']
0
If you need a list of 3-tuples as specified in your question, you can get that with a little extra work. Here's a function to do it:
def triples(rule):
freq = {k: Counter(v) for k, v in rule.items()}
all_values = sorted(set().union(*rule.values()))
sorted_keys = sorted(rule)
return [(v, k, freq[k][v]) for v in all_values for k in sorted_keys]
The only thing here which I think may not be self-explanatory is the all_values = ...
line, which:
set()
union()
of that set with all the individual elements of the lists in rule.values()
(note the use of the argument-unpacking *
operator)sorted()
list.If you still have the original text, you can avoid all that work by using e.g. all_values = sorted(set(original_text))
instead.
Here it is in action:
>>> triples({'a': ['b', 'd'], 'b': ['b', 'c'], 'c': ['a']})
[
('a', 'a', 0), ('a', 'b', 0), ('a', 'c', 1),
('b', 'a', 1), ('b', 'b', 1), ('b', 'c', 0),
('c', 'a', 0), ('c', 'b', 1), ('c', 'c', 0),
('d', 'a', 1), ('d', 'b', 0), ('d', 'c', 0)
]