I am using Apache Tika on Windows 10, jre 1.8.0_181, and I've imported Tika using Maven with the following dependencies:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.21</version>
</dependency>
</dependencies>
I have the code below for performing OCR using Tesseract (which I have independently tested and know to be working):
public static void OCRTest() {
try {
BufferedImage im = ImageIO.read(new File(OCR_IMAGE));
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");
config.setTesseractPath("C:\\Program Files\\Tesseract-OCR");
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
TesseractOCRParser parser = new TesseractOCRParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(im, handler, metadata, parseContext);
System.out.println(handler.toString());
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
} catch (IOException e) {
e.printStackTrace();
}
}
I run into the following exception:
org.apache.tika.exception.TikaException: Failed to close temporary resources
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:174)
at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:251)
at test.test.App.OCRTest(App.java:46)
at test.test.App.main(App.java:30)
Caused by: java.nio.file.FileSystemException: C:\Users\m\AppData\Local\Temp\apache-tika-2643805894084124300.tmp: The process cannot access the file because it is being used by another process.
The tmp file is present in the Temp folder, and the exception seemed to come from not being able to delete it. On the Apache Tika forums, there is a post where someone else has run into the same exception, although with the AutoDetectParser and not Tesseract. Their issue seemed to be a conflict in their imported jars, but I run into this issue even with only the Apache Tika libraries installed.
I don't run into this issue when using the Tika's AutoDetectParser, only with the TesseractOCRParser. Any insights on how to fix the exception would be appreciated!
I posted on the Apache Tika issues forum (https://issues.apache.org/jira/browse/TIKA-2908). The issue came from the order the TesseractOCRParser was closing the open streams - you can see the changes made here: https://github.com/apache/tika/commit/8d386f827eb31e7f1cb189ce942c67a84a0c6bdc?diff=unified#diff-592f390e7558bb6a1fe1c5bc810fe4c8
For now, for anyone who runs into this issue, subclass TesseractOCRParser locally to include the above changes, which should be pushed in the next snapshot release.
Thanks to Tim @ Apache Tika!