Search code examples
pythonmachine-learningdaskxgboostdask-ml

dask_xgboost.predict works but cannot be shown -Data must be 1-dimensional


I am trying to create model using XGBoost.
It seems like I manage to train the model, however, when I try to predict my test data and to see the actual prediction, I get the following error:

ValueError: Data must be 1-dimensional

This is how I tried to predict my data:

from dask_ml.model_selection import train_test_split
import dask
import xgboost
import dask_xgboost
from dask.distributed import Client
import dask_ml.model_selection as dcv

#split the data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=42)

client = Client(n_workers=10, threads_per_worker=1)

#Trying to do hyperparamter running
model_xgb = xgb.XGBRegressor(seed=42,verbose=True)


params={
    'learning_rate':[0.1,0.01,0.05],
    'max_depth':[1,5,8],
    'gamma':[0,0.5,1],
    'scale_pos_weight':[1,3,5]
}

grid_search = GridSearchCV(model_xgb, params, cv=3, scoring='neg_mean_squared_error')

grid_search.fit(x_train, y_train)

#train data with best paraeters
bst = dask_xgboost.train(client, grid_search.best_params_, x_train, y_train, num_boost_round=10)

#predict data
dask_xgboost.predict(client, bst, x_test).persist()

The last line with the predict works, but when I addl compute to the endd in order to see the actual array I get the dimensional error:

dask_xgboost.predict(client, bst, x_test).persist().compute()
>>>ValueError: Data must be 1-dimensional

How can I get predictions with .predict?


Solution

  • As noted in the pip page for dask-xgboost:

    Dask-XGBoost has been deprecated and is no longer maintained.
    The functionality of this project has been included directly
    in XGBoost. To use Dask and XGBoost together, please use
    xgboost.dask instead
    https://xgboost.readthedocs.io/en/latest/tutorials/dask.html.
    

    The code you provided has a few missing assignments and expressions (e.g. how x is defined, where GridSearchCV is imported from). A few things that probably should be changed:

    # note the .dask
    model_xgb = xgb.dask.DaskXGBRegressor(seed=42, verbose=True)
    
    grid_search = GridSearchCV(model_xgb, params, cv=3, scoring='neg_mean_squared_error')
    
    grid_search.fit(x_train, y_train)
    
    #train data with best params
    model_xgb.client = client
    model_xgb.set_params(grid_search.best_params_)
    model_xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)])