In the Tesseract wiki the format for labeled tif/box file filenames to be used in training is given as [lang].[fontname].exp[num]
. Does fontname
actually impact training or is this just for bookkeeping?
In my particular case, I have a large number of document images with different fonts (and I don't know which fonts are in them). Can I just use eng.idontknow.exp[num]
for each document I label manually or will this mess up training for some reason? Thanks in advance!
It's best to match a real font (to help possible post-OCR analyses), but it can be some arbitrary font name.