How to create Lucene index where the documents are scanned images among other things?

My database stores resumes as blob data-field. Resumes may be Microsoft word, pdf or images(.jpg etc).How can we create Lucene index out of these disparate file types, specially .jpg files? Can Tika understand scanned images?

Solution

When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.

Check Apache Tika documentation on images: https://tika.apache.org/1.20/formats.html#Image_formats