Search code examples
pythonmachine-learninghuggingface-transformershuggingface-tokenizershuggingface

Hugginface Transformers Bert Tokenizer - Find out which documents get truncated


I am using the Transforms library from Huggingface to create a text classification model based on Bert. For this I tokenise my documents and I set truncation to be true as my documents are longer than allowed (512).

How can I find out how many documents are actually getting truncated? I don't think the length (512) is character or word count of the document, as the Tokenizer prepares the document as input for the model. What happens to the document and is there a straight forward way to check whether or not it gets truncated?

This is the code I use to tokenise the documents.

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased") 
model = BertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=7)
train_encoded =  tokenizer(X_train, padding=True, truncation=True, return_tensors="pt")

In case you have any more questions about my code or problem, feel free to ask.


Solution

  • your assumption is correct!

    anything with a length larger than 512 ( assuming you are using "distilbert-base-multilingual-cased" ) is truncated by having truncation=True.

    A quick solution would be not truncating and counting examples larger than the max input length of the model:

    
    train_encoded_no_trunc =  tokenizer(X_train, padding=True, truncation=False, return_tensors="pt")
    
    count=0 
    
    for doc in train_encoded_no_trunc.input_ids:
        if(doc>0).sum()> tokenizer.model_max_length: 
            count+=1
    print("number of truncated docs: ",count)