Search code examples
rhadoopparqueth2o

Is it possible to write parquet files to local storage from h2o on hadoop?


I'm working with h2o (latest version 3.26.0.10) on a Hadoop cluster. I've read in a parquet file from HDFS and have performed some manipulation on it, built a model, etc.

I've stored some important results in an H2OFrame that I wish to export to local storage, instead of HDFS. Is there a way to export this file as a parquet?

I tried using h2o.exportFile, documentation here: http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.exportFile.html but all the examples are for writing .csv. I tried using the a file path with .parquet as an extension and that didn't work. It wrote a file but I think it was basically a .csv as it was identical file size to the .csv.

example: h2o.exportFile(iris_hf, path = "/path/on/h2o/server/filesystem/iris.parquet")

On a related note, if I were to export my H2OFrame to HDFS instead of local storage, would it be possible to write that in parquet format? I could at least then move that to local storage.


Solution

  • h2o added support for exporting parquet files as of version 3.38.0.1.

    You need to set the format argument to be "parquet". Note that h2o.exportFile will ignore the parts argument if you specify "parquet". Instead, it chooses the number of parts based on the number of chunks of your data.

    https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.exportFile.html

    h2o.exportFile(
      data = <your h2oFrame>,
      path = "/path/to/exported/parquet/dir",
      format = "parquet"
    )