I'm using Ephesoft community edition 4.0.2.0 with tif images (tested by ephesoft) the problem that ephesoft can classify or extract data from certain images but from others he can't with no error message in files log, i dont now why.
When i click on Learn files the HOCR and HTML generated files are empty with no data just metadata like this :
Application_Checklist_HOCR.xml :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<HocrPages<HocrPage>
<Title></Title><Spans/>
<HocrContent></HocrContent>
</HocrPage></HocrPages>
But for US-invoice_HOCR.xml ephesoft can learn and the file look like this :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><HocrPages><HocrPage>
<Title></Title><Spans><Span><Value>INVOICE</Value><Coordinates><x0>579</x0>
<y0>247</y0><x1>881</x1><y1>304</y1></Coordinates></Span><Span>
<Value>ACME</Value><Coordinates><x0>168</x0><y0>394</y0><x1>311</x1><y1>431</y1>
</Coordinates></Span><Span><Value>Company</Value><Coordinates><x0>329</x0>
<y0>395</y0><x1>541</x1><y1>442</y1></Coordinates></Span><Span>
<Value>lnvoice</Value><Coordinates>............
You can modify the tesseract config file in /Path-To-Ephesoft/Application/WEB-INF/classes/META-INF/dcma-tesseract/tesseract-reader.properties and comment this line #tesseract.command_parameters=-psm 4 to let tesseract use the default segmentation.