Search code examples
rh2oprivacy

Security concerns with the H2O R package


I am using the H2O R package.

My understanding is, that this package requires you to have an internet connection as well as connect to the the h2o servers? If you use the h2o package run machine learning models on your data, does h2o "see" your data? I turned off my wifi and tried running some machine learning models using h2o :

data(iris) 
library(h2o)
h2o.init() 
iris_hf <- as.h2o(iris) 
iris_dl <- h2o.deeplearning(x = 1:4, y = 5, training_frame = iris_hf, seed=123456) 
predictions <- h2o.predict(iris_dl, iris_hf) 

This seems to work, but could someone please confirm? If you do not want anyone to see your data, is it still a good idea to use the "h2o" library? Since the code above runs without an internet connection, I am not sure about this.


Solution

  • From the documentation of h2o.init() (emphasis mine):

    This method first checks if H2O is connectible. If it cannot connect and startH2O = TRUE with IP of localhost, it will attempt to start an instance of H2O with IP = localhost, port = 54321. Otherwise, it stops immediately with an error. When initializing H2O locally, this method searches for h2o.jar in the R library resources [...], and if the file does not exist, it will automatically attempt to download the correct version from Amazon S3. The user must have Internet access for this process to be successful. Once connected, the method checks to see if the local H2O R package version matches the version of H2O running on the server. If there is a mismatch and the user indicates she wishes to upgrade, it will remove the local H2O R package and download/install the H2O R package from the server.

    So, h2o.init() with the default setting ip = "127.0.0.1", as here, connects the R session with the H2O instance (sometimes referred to as "server") in your local machine. If all the necessary package files are in place and up to date, no internet connection is necessary; the package will attempt to connect to the internet only to download stuff in case something is not present or up to date. No data is uploaded anywhere.