I have followed the paper here and the code here (it is implemented using the symmetric kld and a back-off model proposed in the paper in the 1st link) for computing KLD between two text data sets. I have changed the for-loop in the end to return the probability distribution of two data sets to test if both sum to 1:
import re, math, collections
def tokenize(_str):
stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
'are', 'will', 'in', 'it', 'to', 'that']
tokens = collections.defaultdict(lambda: 0.)
for m in re.finditer(r"(\w+)", _str, re.UNICODE):
m = m.group(1).lower()
if len(m) < 2: continue
if m in stopwords: continue
tokens[m] += 1
return tokens
#end of tokenize
def kldiv(_s, _t):
if (len(_s) == 0):
return 1e33
if (len(_t) == 0):
return 1e33
ssum = 0. + sum(_s.values())
slen = len(_s)
tsum = 0. + sum(_t.values())
tlen = len(_t)
vocabdiff = set(_s.keys()).difference(set(_t.keys()))
lenvocabdiff = len(vocabdiff)
""" epsilon """
epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001
""" gamma """
gamma = 1 - lenvocabdiff * epsilon
""" Check if distribution probabilities sum to 1"""
sc = sum([v/ssum for v in _s.itervalues()])
st = sum([v/tsum for v in _t.itervalues()])
ps=[]
pt = []
for t, v in _s.iteritems():
pts = v / ssum
ptt = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
ps.append(pts)
pt.append(ptt)
return ps, pt
I have tested with
d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""
sum(ps)
= 1 but sum(pt)
is way smaller than 1 when:
Is there something that is not correct in the code or else? Thanks!
Update:
In order to make both pt and ps sum to 1, I had to change the code to:
vocab = Counter(_s)+Counter(_t)
ps=[]
pt = []
for t, v in vocab.iteritems():
if t in _s:
pts = gamma * (_s[t] / ssum)
else:
pts = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
else:
ptt = epsilon
ps.append(pts)
pt.append(ptt)
return ps, pt
Both sum(ps) and sum(pt) are the total probability mass of _s and _t over the support of s (by "support of s" I mean all words that appear in _s, regardless of the words that appear in _t). This means that
So, I don't think there's a problem with the code.
Also, contrary to the title of the question, kldiv() does not compute the symmetric KL-divergence, but the KL-divergence between _s and a smoothed version of _t.