I have the same setup and code on mac for running simhash, it works.
But when I run it on Ubuntu, it complaints the implementation of simhash itself has the bug.
Have you encountered such problem?
objs = [(str(k), Simhash(v)) for k, v in index_data.items()] File "/usr/local/lib/python2.7/dist-packages/simhash-1.1.2-py2.7.egg/simhash/init.py", line 30, in init self.build_by_text(unicode(value)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 34: ordinal not in range(128)
The error tells you, that str(k) can't be correctly decoded. Since I don't know where the data is coming from and what it actually is, I can just say that something like
str(k).decode('cp850')
or
Simhash(v.decode('cp850'))
might help. Assuming the string is in cp850. At least I can do a '\xf6'.decode('cp850')
.
And since that seems to be a problem within the module, check, that the string that is used is properly decoded beforehand.