Search code examples
pythonhashlib

hashlib: Unicode-objects must be encoded before hashing


I am running this hashlib code and it runs almost all the way:

def generate_hashes(peaks, fan_value=DEFAULT_FAN_VALUE):
if PEAK_SORT:
  sorted(peaks,key=itemgetter(1))

# bruteforce all peaks
peaks=list(peaks)
len_peaks=len(peaks)
for i in range(len_peaks):
  for j in range(1, fan_value):
    if (i + j) < len(peaks):

      # take current & next peak frequency value
      freq1 = peaks[i][IDX_FREQ_I]
      freq2 = peaks[i + j][IDX_FREQ_I]

      # take current & next -peak time offset
      t1 = peaks[i][IDX_TIME_J]
      t2 = peaks[i + j][IDX_TIME_J]

      # get diff of time offsets
      t_delta = t2 - t1

      # check if delta is between min & max
      if t_delta >= MIN_HASH_TIME_DELTA and t_delta <= MAX_HASH_TIME_DELTA:
        h = hashlib.sha1(("%s|%s|%s") % (str(freq1), str(freq2), str(t_delta)))
        yield (h.hexdigest()[0:FINGERPRINT_REDUCTION], t1)

However, it returns this error:

 h = hashlib.sha1(("%s|%s|%s") % (str(freq1), str(freq2), str(t_delta)))
TypeError: Unicode-objects must be encoded before hashing

I am honestly completely lost and don't know how to fix it. If you guys have any follow up questions regarding details about the code I will try my best to answer. Any feedback would be appreciated.


Solution

  • The answer is in the error message: use encode on your text string before hashing.

    h = hashlib.sha1(("%s|%s|%s" % (str(freq1), str(freq2), str(t_delta))).encode('utf-8'))
    

    The reason this is necessary is because hashlib.sha1() requires a bytes object due to the way it works internally. Normal Python strings (since version 3.0) are made of Unicode codepoints, which don't fit into a byte. They need an encoding which defines how the translation between codepoints and bytes occurs. UTF-8 is the most popular encoding, because it can handle every Unicode codepoint yet remain backwards compatible with older encodings like ASCII.