I am fine-tuning a BERT model on a labeled dataset with many documents longer than the 512 token limit set by the tokenizer. Since truncating would lose a lot of data I would rather use, I started looking for a workaround. However I noticed that simply splitting the documents after 512 tokens (or another heuristic) and creating new entries in the dataset with the same label was never mentioned.
In this answer, someone mentioned that you would need to recombine the predictions, is that necessary when splitting the documents?
Is this generally considered bad practice or does it mess with the integrity of the results?
You have not mentioned if your intention is to classify, but given that you refer to an article on classification I will refer to an approach where you classify the whole text.
The main question is - which part of the text is the most informative for your purpose - or - in other words - does it make sense to use more than the first / last split of text?
When considering long passages of text, frequently, it is enough to consider the first (or last) 512 tokens to correctly predict the class in substantial majority of cases (say 90%). Even though you may loose some precision, you gain on speed and performance of the overall solution and you are getting rid of a nasty problem of figuring out the correct class out of a set of classifications. Why?
Consider an example of text 2100 tokens long. You split it by 512 tokens, obtaining pieces: 512, 512, 512, 512, 52 (notice the small last piece - should you even consider it?). Your target class for this text is, say, A, however you get the following predictions on the pieces: A, B, A, B, C. So you have now a headache to figure out the right method to determine the class. You can:
It turns out that there is no easy way. And you will notice that text is a strange classification material exhibiting all of the above (and more) issues while typically the difference in agreement between the first piece prediction and the annotation vs. the ultimate, perfect classifier is slim at best.
So, spare the effort and strive for simplicity, performance, and heuristic... and clip it!
On details of the best practices you should probably refer to the article from this answer.