Search code examples
pythonpandastensorflowmachine-learningtext-classification

My text classifier model doens't improve with multiple classes


I'm trying to train a model for a text classification and the model take a list of maximum 300 integer embedded from articles. The model trains without problem and all but the accuracy won't go up.

The target consists of 41 categories encoded into int from 0 to 41 and were then normalized.

The table would look like this

Table1

Also, I don't know how my model should look like since I refered on two different example as per below

  • A binary classifier with one input column and one output column Example 1
  • Multiple class classifier with multiple columns as input Example 2

I have tried modifying my model based on both model but the model accuracy won't change and even getting lower per epoch

Should I add more layers to my model or I have done something stupid that I haven't realized?

Note: If the 'df.pickle' download link broken, use this link

from sklearn.model_selection import train_test_split
from urllib.request import urlopen
from os.path import exists
from os import mkdir
import tensorflow as tf
import pandas as pd
import pickle

# Define dataframe path
df_path = 'df.pickle'

# Check if local dataframe exists
if not exists(df_path):
  # Download binary from dropbox
  content = urlopen('https://ucd92a22d5e0d4d29b8edb608305.dl.dropboxusercontent.com/cd/0/get/Askx_25n3JI-jmnZsWXmMmRgd4O2EH1w9l0U6zCMq7xdSXs_IN_i2zuUviseqa9N7-WrReFbGhQi8CeseV5cNsFTO8dzRmSdxjr-MWEDQNpPaZ8Ik29E_58YAjY57qTc4CA/file#').read()

  # Write to file
  with open(df_path, 'wb') as file: file.write(content)

  # Load the dataframe from bytes
  df = pickle.loads(content)
# If the file exists (aka. downloaded)
else:
  # Load the dataframe from file
  df = pickle.load(open(df_path, 'rb'))

# Normalize the category
df['Category_Code'] = df['Category_Code'].apply(lambda x: x / 41)

train_df, test_df = [pd.DataFrame() for _ in range(2)]

x_train, x_test, y_train, y_test = train_test_split(df['Content_Parsed'], df['Category_Code'], test_size=0.15, random_state=8)
train_df['Content_Parsed'], train_df['Category_Code'] = x_train, y_train
test_df['Content_Parsed'], test_df['Category_Code'] = x_test, y_test

# Variable containing the number of words we want to keep in our vocabulary
NUM_WORDS = 10000
# Input/Token length
SEQ_LEN = 300

# Create tokenizer for our data
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=NUM_WORDS, oov_token='<UNK>')
tokenizer.fit_on_texts(train_df['Content_Parsed'])

# Convert text data to numerical indexes
train_seqs=tokenizer.texts_to_sequences(train_df['Content_Parsed'])
test_seqs=tokenizer.texts_to_sequences(test_df['Content_Parsed'])

# Pad data up to SEQ_LEN (note that we truncate if there are more than SEQ_LEN tokens)
train_seqs=tf.keras.preprocessing.sequence.pad_sequences(train_seqs, maxlen=SEQ_LEN, padding="post")
test_seqs=tf.keras.preprocessing.sequence.pad_sequences(test_seqs, maxlen=SEQ_LEN, padding="post")

# Create Models folder if not exists
if not exists('Models'): mkdir('Models')

# Define local model path
model_path = 'Models/model.pickle'

# Check if model exists/pre-trained
if not exists(model_path):
  # Define word embedding size
  EMBEDDING_SIZE = 16

  # Create new model
  '''
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
    # tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  '''
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
      # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
      tf.keras.layers.GlobalAveragePooling1D(),
      tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  # Compile the model
  model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
  )

  # Stop training when a monitored quantity has stopped improving.
  es = tf.keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', patience=1)

  # Define batch size (Can be tuned to improve model accuracy)
  BATCH_SIZE = 16
  # Define number or cycle to train
  EPOCHS = 20

  # Using GPU (If error means you don't have GPU. Use CPU instead)
  with tf.device('/GPU:0'):
    # Train/Fit the model
    history = model.fit(
      train_seqs, 
      train_df['Category_Code'].values, 
      batch_size=BATCH_SIZE, 
      epochs=EPOCHS, 
      validation_split=0.2,
      validation_steps=30,
      callbacks=[es]
    )

  # Evaluate the model
  model.evaluate(test_seqs, test_df['Category_Code'].values)

  # Save the model into a file
  with open(model_path, 'wb') as file: file.write(pickle.dumps(model))
else:
  # Load the model
  model = pickle.load(open(model_path, 'rb'))

# Check the model
model.summary()

Solution

  • After 2 days of tweaking and understanding more examples I found this website which explains quite well about the multi-class classification.

    The details of changes I made are as follows:

    1. Since I'm going to build a model for multiple classes, during the model compilation the model should use categorical_crossentropy as it's loss function instead of binary_crossentropy.

    2. The model should produce number of output with similar length as your total class you're going to classify which in my case 41. (One hot encoding)

    3. The last layer's activation function should be "softmax" since we're choosing a label with the highest confidence level (closest to 1.0).

    4. You will need to tweak the layers accordingly based on the number of classes you're going to classify. See here on how to improve your model.

    My final code would look something just like this

    from sklearn.model_selection import train_test_split
    from urllib.request import urlopen
    from functools import reduce
    from os.path import exists
    from os import listdir
    from sys import exit
    import tensorflow as tf
    import pandas as pd
    import pickle
    import re
    
    # Specify dataframe path
    df_path = 'df.pickle'
    # Check if the file exists
    if not exists(df_path):
      # Specify url of the dataframe binary
      url = 'https://www.dropbox.com/s/76hibe24hmpz3bk/df.pickle?dl=1'
      # Read the byte content from url
      content = urlopen(url).read()
      # Write to a file to save up time
      with open(df_path, 'wb') as file: file.write(pickle.dumps(content))
      # Unpickle the dataframe
      df = pickle.loads(content)
    else:
      # Load the pickle dataframe
      df = pickle.load(open(df_path, 'rb'))
    
    # Useful variables
    MAX_NUM_WORDS = 50000                        # Vocabulary size for our tokenizer
    MAX_SEQ_LENGTH = 600                         # Maximum length of tokens (for padding later)
    EMBEDDING_SIZE = 256                         # Embedding size (Tweak to improve accuracy)
    OUTPUT_LENGTH = len(df['Category'].unique()) # Number of class to be classified
    
    # Create our tokenizer
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_WORDS, lower=True)
    # Fit our tokenizer with words/tokens
    tokenizer.fit_on_texts(df['Content_Parsed'].values)
    # Get our token vocabulary
    word_index = tokenizer.word_index
    print('Found {} unique tokens'.format(len(word_index)))
    
    # Parse our text into sequence of numbers using our tokenizer
    X = tokenizer.texts_to_sequences(df['Content_Parsed'].values)
    # Pad the sequence up to the MAX_SEQ_LENGTH
    X = tf.keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQ_LENGTH)
    print('Shape of feature tensor: {}'.format(X.shape))
    
    # Convert our labels into dummy variable (More info on the link provided above)
    Y = pd.get_dummies(df['Category']).values
    print('Shape of label tensor: {}'.format(Y.shape))
    
    # Split our features and labels into test and train dataset
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)
    print(x_train.shape, y_train.shape)
    print(x_test.shape, y_test.shape)
    
    # Creating our model
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(MAX_NUM_WORDS, EMBEDDING_SIZE, input_length=MAX_SEQ_LENGTH))
    model.add(tf.keras.layers.SpatialDropout1D(0.2))
    # The number 64 could be changed based on your model performance
    model.add(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2))
    # Our output layer with length similar to the OUTPUT_LENGTH
    model.add(tf.keras.layers.Dense(OUTPUT_LENGTH, activation='softmax'))
    # Compile our model with "categorical_crossentropy" loss function
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    # Model variables
    EPOCHS = 100                          # Number of cycle to run (The early stopping may stop the training process accordingly)
    BATCH_SIZE = 64                       # Batch size (Tweaking this may improve model performance a bit)
    checkpoint_path = 'model_checkpoints' # Checkpoint path of our model
    
    # Use GPU if available
    with tf.device('/GPU:0'):
      # Fit/Train our model
      history = model.fit(
        x_train, y_train,
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        validation_split=0.1,
        callbacks=[
          tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.0001),
          tf.keras.callbacks.ModelCheckpoint(
            checkpoint_path, 
            monitor='val_acc', 
            save_best_only=True, 
            save_weights_only=False
          )
        ],
        verbose=1
      )
    

    Now, my model accuracies perform well and are increasing each epoch but since the validation accuracies (val_acc around 76~77 percent) are not performing well, I may need to tweak the model/layers a bit.

    The output snapshot is provided below

    Output snapshot.png