python nlp huggingface-transformers text-classification huggingface-tokenizers

huggingface longformer case sensitive tokenizer

This page shows how to build a longformer based classification.

import pandas as pd
import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn as nn
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import wandb
import os


# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                           gradient_checkpointing=False,
                                                           attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)

I noticed that the tokenizer is sensitive to case of the data. words do and Do get different tokens below. I dont need such behavior. I can always lowercase my data before feeding to longformer. But is there is any other better way to tell tokenizer to ignore case of the data?

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [0, 8275, 45, 31510, 459, 11, 5, 5185, 9, 44859, 6, 13, 51, 32, 12405, 8, 2119, 7, 6378, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

encoded_input3 = tokenizer("do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input3)
{'input_ids': [0, 5016, 45, 31510, 459, 11, 5, 5185, 9, 44859, 6, 13, 51, 32, 12405, 8, 2119, 7, 6378, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Solution

In my opinion, it is better not to modify tokenization schemes of pretrained transformers: they were pretrained with a certain vocabulary, and changing it may lead to their deterioration.

If you still want to make the system insensitive to capitalization, and if you want to use the same model and the same tokenizer without modifying them too much, then just lowercasing the input data is a sensible strategy.

However, if you are willing to spend resources on updating the model and the tokenizer, you can do the following:

Modify the tokenizer: add a Lowercase normalizer into its pipeline.
(optionally) Modify the vocabulary of the tokenizer: drop the unused words that contain uppercase characters, and maybe add some lowercase words. The embedding and output layers of the model should be modified accordingly, and there is no standardized code for this, so such manipulations are not recommended unless you understand well what you are doing. Still, such manipulations could improve model performance.
Fine-tune the model (with the original masked language modelling task) on a large dataset using the updated tokenizer. This will make the neural network better aware of the new lowercase texts that it may discover.

This will make the system better adapted to uncased texts, but it will cost you time (to write the code for updating the vocabulary) and computational resources (to fine-tune the model).