Search code examples
pythonnumpypython-c-apihashlib

Hashing numpy object array, how does hashlib see objects' contents and not just pointers?


I am wondering, how come hashing e.g. strings in an np object[] produces expected results:

>>> hashlib.sha256(np.array(['asdfda'], dtype=object)).hexdigest()
'6cc08fd2542235fe8097c017c20b85350899c81616db8cb59045022663e3cee1'
>>> hashlib.sha256(np.array(['asd'+'fda'], dtype=object)).hexdigest()
'6cc08fd2542235fe8097c017c20b85350899c81616db8cb59045022663e3cee1'

That is, the hashing takes into account the actual object value, not a just the pointer value, as stored in the array. (Those strings would definitely have different pointers.)

hashlib methods seem to accepting objects supporting some 'buffer API', as not doing so produces TypeError: object supporting the buffer API required.

Does that mean that buffer API implementation for numpy's ndarray does not return an array of pointers, but rather somehow an array of strings, or in other words how does hashlib.hash_algorithm get to those stored strings of characters?


Solution

  • Those strings would definitely have different pointers.

    Definitely is a pretty strong claim here. Look what I see just testing that out in a REPL:

    >>> s = 'asdfda'
    >>> s2 = 'asd'+'fda'
    >>> s is s2
    True
    

    However,

    >>> s3 = s[:2] + s[2:]
    >>> s is s3
    False
    >>>
    

    And just as expected, the hash is different:

    >>> hashlib.sha256(np.array([s],dtype=object)).hexdigest()
    '176c63097ace4b6754acdd8e37b861bbe1e33489f52d6bd8df07983ead23c73e'
    >>> hashlib.sha256(np.array([s3],dtype=object)).hexdigest()
    '478307a1bfb4bf413c7e538cc4bbe02370072b0968a91155a4a838e68477f62e'
    >>>