Search code examples
pythonrh2o

Calculate hash of an h2o frame


I would like to calculate some hash value of an h2o.frame.H2OFrame. Ideally, in both R and python. My understanding of h2o.frame.H2OFrame is that these objects basically "live" on the h2o server (i.e., are represented by some Java objects) and not within R or python from where they might have been uploaded.

I want to calculate the hash value "as close as possible" to the actual training algorithm. That rules out calculation of the hash value on (serializations of) the underlying R or python objects, as well as on any underlying files from where the data was loaded. The reason for this is that I want to capture all (possible) changes that h2o's upload functions perform on the underlying data.

Inferring from the h2o docs, there is no hash-like functionality exposed through h2o.frame.H2OFrame. One possibility to achieve a hash-like summary of the h2o data is through summing over all numerical columns and doing something similar for categorical columns. However, I would really like to have some avalanche effect in my hash function so that small changes in the function input result in large differences of the output. This requirement rules out simple sums and the like.

Is there already some interface which I might have overlooked? If not, how could I achieve the task described above?

import h2o
h2o.init()
iris_df=h2o.upload_file(path="~/iris.csv")

# what I would like to achieve
iris_df.hash()
# >>> ab2132nfqf3rf37 

# ab2132nfqf3rf37 is the (made up) hash value of iris_df

Thank you for your help.


Solution

  • It is available in the REST API 1 (see screenshot) you can probably get to it in the H2OFrame object in Python as well but it is not directly exposed.