I must be missing something ...
I want to use a pretrained model with HuggingFace:
transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(transformer_name)
Now that I have my model and my tokenizer, I need to tokenize my dataset, but I don't know which parameters (padding, truncation, max_length) to use with my Tokenizer.
Some examples just call the tokenizer tokenizer(data)
, others use truncation only tokenizer(data, truncation=True)
, and others will use many parameters tokenizer(data, padding=True, truncation=True, return_tensors='pt', max_length=512)
.
As I am reloading a pretrained Tokenizer, I would have love it to use the same parameters as in the original training process. How do I know which parameters to use ?
My understanding is that I always need to truncate my data and leave max_length
to None
so that my sequences length will always be lower than the model's maximum length. Is that it ? Does leaving max_length
to None
makes it backup on the model's maximum length ?
And what should I do with padding
? As I am using a Trainer
object for training with a DataCollatorWithPadding
should I set padding
to False
to reduce the memory impact and let the collator pad my batches ?
Final question : what should I do if I use a TextClassificationPipeline
for inference ? Should I specify these parameters (padding, etc.) ? Will the pipeline handle it for me ?
max_length
is None
then the maximum acceptable input length for the model is considered. (see docs).DataCollatorWithPadding
. More about it in this video.