jakarta-ee pdf indexing ocr full-text-indexing

Document Storage with Full Text Indexing - PDF

We have built an application for indexing submitted documents in many formats, spanning across Microsoft Office to text. The issue is that, for pdf, we often resort to converting to Word, then indexing. This is a slow process and problematic especially because it doesn't handle image-based pdfs where an OCR component would be required.

This question focuses on a solution to providing my users with full-text searching of a document library of pdfs. If there are comparable solutions, one that will also handle Microsoft Office formats is preferred.

Currently, my application uses the J2EE Platform with a MySQL database. I'd be open to switching to a non-relational database if it provided significant benefit.

Solution

I am open to other ideas, but this is the best solution I have been able to find in my research.

I investigated many tools and ended up in a toss-up between the likes of Amazon Cloud Search and Google Drive SDK. Both have strong indexing, tagging, and custom attributes capabilities allowing for robust, full-text searching.

Amazon Cloud Search unfortunately, out of the box, does not provide PDF indexing (source) and even with workarounds such as using the experimental command line tool (documented here) to generate SDF from the input file and then submitting via the API, I would then have to integrate my own or another third party OCR tool.

Google Drive SDK/API while there is a significant downside, requiring that each user has a Google account (by sharing account across users I would then have to download files to serve them, since file permissions couldn't be worked around easily via a URI), this platform meets and exceeds my desired functionality. All one would need to do when uploading is to set the OCR parameter to true.