I am wondering, how come hashing e.g. strings in an np object[] produces expected results:
>>> hashlib.sha256(np.array(['asdfda'], dtype=object)).hexdigest()
'6cc08fd2542235fe8097c017c20b85350899c81616db8cb59045022663e3cee1'
>>> hashlib.sha256(np.array(['asd'+'fda'], dtype=object)).hexdigest()
'6cc08fd2542235fe8097c017c20b85350899c81616db8cb59045022663e3cee1'
That is, the hashing takes into account the actual object value, not a just the pointer value, as stored in the array. (Those strings would definitely have different pointers.)
hashlib
methods seem to accepting objects supporting some 'buffer API', as not doing so produces TypeError: object supporting the buffer API required
.
Does that mean that buffer API implementation for numpy's ndarray does not return an array of pointers, but rather somehow an array of strings, or in other words how does hashlib.hash_algorithm
get to those stored strings of characters?
Those strings would definitely have different pointers.
Definitely is a pretty strong claim here. Look what I see just testing that out in a REPL:
>>> s = 'asdfda'
>>> s2 = 'asd'+'fda'
>>> s is s2
True
However,
>>> s3 = s[:2] + s[2:]
>>> s is s3
False
>>>
And just as expected, the hash is different:
>>> hashlib.sha256(np.array([s],dtype=object)).hexdigest()
'176c63097ace4b6754acdd8e37b861bbe1e33489f52d6bd8df07983ead23c73e'
>>> hashlib.sha256(np.array([s3],dtype=object)).hexdigest()
'478307a1bfb4bf413c7e538cc4bbe02370072b0968a91155a4a838e68477f62e'
>>>