Search code examples
pythontensorflowkeras

Unable to encode categorical variables for Deep learning classification model


I tried to train a convolutional neural network to predict the labels (categorical data) given the criteria (text). This should have been a simple classification problem. There are 7 labels, hence my network has 7 output neurons with sigmoid activation functions.

I encoded training data using the following simple format, in a txt file, using text descriptors ('criteria') and categorical label variables ('label'):

'criteria'|'label'

Here's a peak at one entry from data file:

Headache location: Bilateral (intracranial). Facial pain: Nil. Pain quality: Pulsating. Thunderclap onset: Nil. Pain duration: 11. Pain episodes per month: 26. Chronic pain: No. Remission between episodes: Yes. Remission duration: 25. Pain intensity: Moderate (4-7). Aggravating/triggering factors: Innocuous facial stimuli, Bathing and/or showering, Chocolate, Exertion, Cold stimulus, Emotion, Valsalva maneuvers. Relieving factors: Nil. Headaches worse in the mornings and/or night: Nil. Associated symptoms: Nausea and/or vomiting. Reversible symptoms: Nil. Examination findings: Nil. Aura present: Yes. Reversible aura: Motor, Sensory, Brainstem, Visual. Duration of auras: 47. Aura in relation to headache: Aura proceeds headache. History of CNS disorders: Multiple Sclerosis, Angle-closure glaucoma. Past history: Nil. Temporal association: No. Disease worsening headache: Nil. Improved cause: Nil. Pain ipsilateral: Nil. Medication overuse: Nil. Establish drug overuse: Nil. Investigations: Nil.|Migraine with aura

Here's a snippet of the code from the training algorithm:

'''A. IMPORT DATA'''
dataset = pd.read_csv('Data/ICHD3_Database.txt', names=['criteria', 'label'], sep='|') 
features = dataset['criteria'].values 
labels = dataset['label'].values 
labels = labels.reshape(len(labels), 1) # Reshape target to be a 2d array

'''B. DATA PRE-PROCESSING: WORD EMBEDDINGS'''
def word_embeddings(features):
    maxlen = 200
    features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=42) 
    tokenizer = Tokenizer(num_words=5000)
    tokenizer.fit_on_texts(features_train)
    features_train = pad_sequences(tokenizer.texts_to_sequences(features_train), padding='post', maxlen=maxlen)
    features_test = pad_sequences(tokenizer.texts_to_sequences(features_test), padding='post', maxlen=maxlen) 
    labels_train = pad_sequences(tokenizer.texts_to_sequences(labels_train), padding='post', maxlen=maxlen)
    labels_test = pad_sequences(tokenizer.texts_to_sequences(labels_test), padding='post', maxlen=maxlen)
    vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
    return features_train, features_test, labels_train, labels_test, vocab_size, maxlen

features_train, features_test, labels_train, labels_test, vocab_size, maxlen = word_embeddings(features) # Pre-process text using word embeddings

'''C. CREATE THE MODEL'''
def design_model(features, hidden_layers=2, number_neurons=128):
    model = Sequential(name = "My_Sequential_Model") 
    model.add(layers.Embedding(input_dim=vocab_size, output_dim=50, input_length=maxlen)) 
    model.add(layers.Conv1D(128, 5, activation='relu'))
    model.add(layers.GlobalMaxPool1D()) 
    for i in range(hidden_layers): 
        model.add(Dense(number_neurons, activation='relu')) 
        model.add(Dropout(0.2)) 
    model.add(Dense(7, activation='sigmoid')) 
    opt = Adam(learning_rate=0.01) 
    model.compile(loss='binary_crossentropy', metrics=['mae'], optimizer=opt) 
    return model

I then pipe the model through a GridSearchCV to find the optimal number of epochs, batch size, etc.

However, before it even gets to the GridSearchCV, when I run it, I get the following error:

Traceback (most recent call last):
  File "c:\Users\user\Desktop\Deep Learning\deep_learning_headache.py", line 51, in <module>
    features_train, features_test, labels_train, labels_test, vocab_size, maxlen = word_embeddings(features) # Pre-process text using word embeddings
  File "c:\Users\user\Desktop\Deep Learning\deep_learning_headache.py", line 45, in word_embeddings
    labels_train = pad_sequences(tokenizer.texts_to_sequences(labels_train), padding='post', maxlen=maxlen)
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\preprocessing\text.py", line 357, in texts_to_sequences
    return list(self.texts_to_sequences_generator(texts))
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\preprocessing\text.py", line 386, in texts_to_sequences_generator
    seq = text_to_word_sequence(
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\preprocessing\text.py", line 74, in text_to_word_sequence
    input_text = input_text.lower()
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Where am I going wrong?


Solution

  • Based on the exception, it's expecting a string, and not a numpy ndarray (in the tokenizer)?

    AttributeError: 'numpy.ndarray' object has no attribute 'lower'

    You can use the call stack to find the line of your code where you'll likely find the wrong type of thing being passed in:

    File "c:\Users\user\Desktop\Deep Learning\deep_learning_headache.py", line 45, in word_embeddings labels_train = pad_sequences(tokenizer.texts_to_sequences(labels_train), padding='post', maxlen=maxlen)

    I'd take a look at tokenizer.texts_to_sequences documentation, and examine what type of data is in labels_train, and see if you're passing in the shape/type of thing here.