Search code examples
hashrowpysparkrdd

How to generate a hash for each row of rdd? (PYSPARK)


As specified in the question, I'm trying to generate an hash for each row of RDD. For my purpose I cannot use zipWithUniqueId() method, I need one hash of all the columns, for each Row of the RDD.

for row in DataFrame.collect():
    return hashlib.sha1(str(row))

I know that is the worst way,iterating into rdd, but I'm beginner with pyspark. However the problems is that: I obtain the same hash for each row. I tried to use strong collision resistant hash function but it is too slow. Is there some way to solve the problem? Thanks in advance :)


Solution

  • Your hashing method seems to be OK. Are you sure you use python in proper way? If you place provided code into a function it will always return hash of first row in dataframe as there is return inside loop.

    You can calculate hashes in distributed way by going from Dataframe to RDD and perform mapping, for example:

    >>> import hashlib
    >>> numbers = spark.range(10)
    >>> numbers.show()
    +---+
    | id|
    +---+
    |  0|
    |  1|
    |  2|
    |  3|
    |  4|
    |  5|
    |  6|
    |  7|
    |  8|
    |  9|
    +---+
    
    >>> numbers.rdd.map(lambda row: hashlib.sha1(str(row)).hexdigest()).collect()
    ['ec0dbe879dee5ca3b0d5f80687993273213611c9', 
    'd19469cfdac63a279b2068a989bebb8918af721a', 
    'c5051bbf3ac45c49e29041b9bd840badd484fd94', 
    '7916b1b00f01e1676a3ed7ff80e9614430c74e4d', 
    '3ef92cd5a3abdbf996694ba08685676b26478121', 
    '6e0820c8a947c2d0f53c2d2957e4d256f6e75f25', 
    '2297e8b06e13cc79861aed7c919b5333dfe39049', 
    '1b64fd47d48f2fc7d7d45a4c6e9b1958e973ab8c', 
    '6e53b27c52c20e2fb2ffa5b3a1013c13fad21db7', 
    '02d08951fde664abbbec94b37ab322e751c40e33']