Trying to work through this notebook https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html.
Using spark version 2.4.3 and xgboost 0.90
Keep getting this error ValueError: bad input shape ()
when trying to execute ...
features = inputTrainingDF.select("features").collect()
lables = inputTrainingDF.select("label").collect()
X = np.asarray(map(lambda v: v[0].toArray(), features))
Y = np.asarray(map(lambda v: v[0], lables))
xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic')
model = xgbClassifier.fit(X, Y)
ValueError: bad input shape ()
and
def trainXGbModel(partitionKey, labelAndFeatures):
X = np.asarray(map(lambda v: v[1].toArray(), labelAndFeatures))
Y = np.asarray(map(lambda v: v[0], labelAndFeatures))
xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic' )
model = xgbClassifier.fit(X, Y)
return [partitionKey, model]
xgbModels = inputTrainingDF\
.select("education", "label", "features")\
.rdd\
.map(lambda row: [row[0], [row[1], row[2]]])\
.groupByKey()\
.map(lambda v: trainXGbModel(v[0], list(v[1])))
xgbModels.take(1)
ValueError: bad input shape ()
You can see in the notebook it is working for whoever posted it. My guess is it has something to do with the X
and Y
np.asarray()
mapping because the logic is just trying to map the label and features to the function but the shapes are empty. Got it working using this code
pandasDF = inputTrainingDF.toPandas()
series = pandasDF['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
features = np.apply_along_axis(lambda x : x[0], 1, series)
target = pandasDF['label'].values
xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic' )
model = xgbClassifier.fit(features, target)
however want to integrate into the original function call & understand why the original notebook does not work. An extra set of eyes to troubleshoot this would be much appreciated!
You are probably using python3. The issue is that in python3 map
function returns an iterator object, rather than a collection. All you have to do to fix this example is to change map
-> list(map(...))
:
def trainXGbModel(partitionKey, labelAndFeatures):
X = np.asarray(list(map(lambda v: v[1].toArray(), labelAndFeatures)))
Y = np.asarray(list(map(lambda v: v[0], labelAndFeatures)))
Or you can use np.fromiter to convert iterable object to numpy array.