This page shows how to build a longformer based classification.
import pandas as pd
import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn as nn
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import wandb
import os
# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
gradient_checkpointing=False,
attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)
I noticed that the tokenizer is sensitive to case of the data. words do
and Do
get different tokens below. I dont need such behavior. I can always lowercase my data before feeding to longformer. But is there is any other better way to tell tokenizer to ignore case of the data?
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [0, 8275, 45, 31510, 459, 11, 5, 5185, 9, 44859, 6, 13, 51, 32, 12405, 8, 2119, 7, 6378, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
encoded_input3 = tokenizer("do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input3)
{'input_ids': [0, 5016, 45, 31510, 459, 11, 5, 5185, 9, 44859, 6, 13, 51, 32, 12405, 8, 2119, 7, 6378, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
In my opinion, it is better not to modify tokenization schemes of pretrained transformers: they were pretrained with a certain vocabulary, and changing it may lead to their deterioration.
If you still want to make the system insensitive to capitalization, and if you want to use the same model and the same tokenizer without modifying them too much, then just lowercasing the input data is a sensible strategy.
However, if you are willing to spend resources on updating the model and the tokenizer, you can do the following:
This will make the system better adapted to uncased texts, but it will cost you time (to write the code for updating the vocabulary) and computational resources (to fine-tune the model).