I'm attempting to stack a BERT tensorflow model with and XGBoost model in python. To do this, I have trained the BERT model and and have a generator that takes the predicitons from BERT (which predicts a category) and yields a list which is the result of categorical data concatenated onto the BERT prediction. This doesn't train, however because it doesn't have a shape. The code I have is:
...
categorical_inputs=df[cat_cols]
y=pd.get_dummies(df[target_col]).values
xgboost_labels=df[target_col].values
concatenated_text_input=df['concatenated_text']
text_model.fit(tf.constant(concatenated_text_input),tf.constant(y), epochs=8)
cat_text_generator=(list(categorical_inputs.iloc[i].values)+list(text_model.predict([concatenated_text_input.iloc[i]])[0]) for i in range(len(categorical_inputs)))
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
clf.fit(cat_text_generator, xgboost_labels)
and the error I get is:
...
-> 1153 if len(X.shape) != 2:
1154 # Simply raise an error here since there might be many
1155 # different ways of reshaping
AttributeError: 'generator' object has no attribute 'shape'
Although it's possible to create a list or array to hold the data, I would prefer a solution that would work for when there's too much data to hold in memory at once. Is there a way to use generators to train an xgboost model?
def generator(X_data,y_data,batch_size):
while True:
for step in range(X_data.shape[0]//batch_size):
start=step*batch_size
end=step*(batch_size+1)
current_x=X_data.iloc[start]
current_y=y_data.iloc[start]
#Or if it's an numpy array just get the rows
yield current_x,current_y
Generator=generator(X,y)
batch_size=32
number_of_steps=X.shape[0]//batch_size
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
for step in number_of_steps:
X_g,y_g=next(Generator)
clf.fit(X_g, y_g)