I'm using alfresco-simple-ocr with pdfsandwich and tesseract OCR. I want to get the text from a document inserted to a folder and then use the text and a pdf file in a new workflow.
I've managed to do OCR extraction and how to start a workflow with a file inserted to catalogue,
but I can't get text from file and use it in the workflow.
Is there a possibility to do this?
Where can I start implementing that function ?
Greetings, Rafał
You don't need any extension for that. Alfresco already integrates PDfBox that will do that for you. After, it depends of your PDF if it's a PDF containing images (so scanned documents) or if it's a PDF containing already text inside. If you want to OCR some images, you have as well this module: https://github.com/bchevallereau/alfresco-tesseract
When you know what you want to transform, you can look at this page where you have a javascript sample on how to call transformers: http://docs.alfresco.com/5.2/references/dev-extension-points-content-transformer.html You can do that as well in Java if you need.