Search code examples
pythonpython-3.xtensorflowxgboostbert-language-model

How can I train an XGBoost with a generator?


I'm attempting to stack a BERT tensorflow model with and XGBoost model in python. To do this, I have trained the BERT model and and have a generator that takes the predicitons from BERT (which predicts a category) and yields a list which is the result of categorical data concatenated onto the BERT prediction. This doesn't train, however because it doesn't have a shape. The code I have is:

...
categorical_inputs=df[cat_cols]
y=pd.get_dummies(df[target_col]).values
xgboost_labels=df[target_col].values
concatenated_text_input=df['concatenated_text']
text_model.fit(tf.constant(concatenated_text_input),tf.constant(y), epochs=8)
cat_text_generator=(list(categorical_inputs.iloc[i].values)+list(text_model.predict([concatenated_text_input.iloc[i]])[0]) for i in range(len(categorical_inputs)))


clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                       gamma=1)
clf.fit(cat_text_generator, xgboost_labels)

and the error I get is:

...
-> 1153         if len(X.shape) != 2:
   1154             # Simply raise an error here since there might be many
   1155             # different ways of reshaping

AttributeError: 'generator' object has no attribute 'shape'

Although it's possible to create a list or array to hold the data, I would prefer a solution that would work for when there's too much data to hold in memory at once. Is there a way to use generators to train an xgboost model?


Solution

  • def generator(X_data,y_data,batch_size):
        while True:
          for step in range(X_data.shape[0]//batch_size):
              start=step*batch_size
              end=step*(batch_size+1)
              current_x=X_data.iloc[start]
              current_y=y_data.iloc[start] 
              #Or if it's an numpy array just get the rows
              yield current_x,current_y
    
    Generator=generator(X,y)
    batch_size=32
    number_of_steps=X.shape[0]//batch_size
    
    clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                           gamma=1)
     
    for step in number_of_steps:
        X_g,y_g=next(Generator)
        clf.fit(X_g, y_g)