Search code examples
pythonpandasnumpypysparkh2o

Python/pyspark : pass h20 dataframe to sklearn kneighbors as array


I have a h20 frame, which I need to pass to sklearn kneighbors (NearestNeighbors), If i'm not wrong "from sklearn.neighbors import NearestNeighbors" accepts only arrays, , I tried for one single row, it's working. But, How can I pass the who h20 daframe to that function? I guess I can use a for loop, but wondering is there any other efficient way. FYI -I'm using pyspark for my implementation

from sklearn.neighbors import NearestNeighbors

h20_df_mod_output = model_name(input_Dataset)
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(centroid_values['centroids'])
distance, indices = neigh.kneighbors([h20_df_mod_output[1,:]]) # How can I pass the entire dataset here?

Solution

  • I believe the algorithms from Scikit-Learn do not accept H2O Frames. So, you can convert the H2O Frames, for example, into Pandas DataFrames, by doing:

    pandas_frame = h2o_frame.as_data_frame()