Search code examples
pandasnumpytensorflowkerasauto-keras

NumPy array value error from training in Auto-Keras with StratifiedKFold


Background

My sentiment analysis research comes across a variety of datasets. Recently I've encountered one dataset that somehow I just cannot train successfully. I mostly work with open data in .CSV file format, hence Pandas and NumPy are heavily used.

During my research, one of the approaches is trying to integrate automated machine learning (AutoML), and the library I chose to use was Auto-Keras, mainly using its TextClassifier() wrapper function to achieve AutoML.

Main Problem

I've verified with official documentation, that the TextClassifier() takes data in the format of the NumPy array. However, when I load the data into Pandas DataFrame and used .to_numpy() on the columns that I need to train, the following error kept showing:


ValueError                                Traceback (most recent call last)
<ipython-input-13-1444bf2a605c> in <module>()
     16     clf = ak.TextClassifier(overwrite=True, max_trials=2)
     17 
---> 18     clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
     19 
     20 

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).

Error-related code sectors

The sector where I drop the unneeded Pandas DataFrame columns using .drop(), and convert the needed columns to NumPy Array using the to_numpy() function that Pandas has provided.


df_src = pd.read_csv(get_data)

df_src = df_src.drop(columns=["Name", "Cast", "Plot", "Direction",
                "Soundtrack", "Acting", "Cinematography"])

df_src = df_src.reset_index(drop=True)

X = df_src["Review"].to_numpy()

Y = df_src["Overall Sentiment"].to_numpy()

print(X, "\n")
print("\n", Y)

The main error code part, where I perform StratifedKFold() and at the same time, use TextClassifier() to train and test the model.


fold = 0
for train, test in skf.split(X, Y):
    fold += 1
    print(f"Fold #{fold}\n")
    
    x_train = X[train]
    y_train = Y[train]
    
    x_test = X[test]
    y_test = Y[test]
    
    
    cbs = [tf.keras.callbacks.EarlyStopping(patience=3)]
    
    clf = ak.TextClassifier(overwrite=True, max_trials=2)
    
    
    # The line where it indicated the error.
    clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
    
    
    pred = clf.predict(x_test) # result data type is in lists of `string`
    
    ceval = clf.evaluate(x_test, y_test)
    
    metrics_test = metrics.classification_report(y_test, np.array(list(pred), dtype=int))
    
    print(metrics_test, "\n")
    
    print(f"Fold #{fold} finished\n")

Supplementary

I am sharing the full code related to the error through Google Colab, which you can help me diagnose here.

Edit notes

I have tried the potential solution, such as:

x_train = np.asarray(x_train).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)

or

x_train = tf.data.Dataset.from_tensor_slices((x_train,))
y_train = tf.data.Dataset.from_tensor_slices((y_train,))

However, the problem remains.


Solution

  • One of the strings is equal to nan. Just remove this entry and the corresponding label.