Search code examples
javaabbyyfinereader

FineReader Engine Java SDK. How to ignore pictures during conversion from PDF to DOCX


I need to find a way to ignore pictures and photos from PDF document during conversion to DOCX file.

I am creating an instance of FineReader Engine:

IEngine engine = Engine.InitializeEngine(
engineConfig.getDllFolder(), engineConfig.getCustomerProjectId(),
engineConfig.getLicensePath(), engineConfig.getLicensePassword(), "", "", false);

After that, I am converting a document:

IFRDocument document = engine.CreateFRDocument();
document.AddImageFile(file.getAbsolutePath(), null, null);
document.Process(null);
String exportPath = FileUtil.prepareExportPath(file, resultFolder);
document.Export(exportPath, FileExportFormatEnum.FEF_DOCX, null);

As a result, it converts all images from the initial pdf document.


Solution

  • I'm not really familiar with PDF to DOCX conversion, but i think you could try custom profiles according to your needs.

    At some point in your code you should create a Engine object, and then create a Document object (or IFRDocument object depending of your application). Add this line just before giving your document to your engine for processing:

    engine.LoadProfile(PROFILE_FILENAME);
    

    Then create your file with some processing parameters described in the documentation packaged with your FRE installation under "Working with Profiles" section. Do not forget to add in your file:

    ... some params under other sections
    
    [PageAnalysisParams]
    DetectText = TRUE       --> force text detection
    DetectPictures = FALSE  --> ignore pictures
    ... other params under PageAnalysisParams
    
    ... some params under other sections
    

    It works the same way for Barcodes, etc... But keep in mind to benchmark your results when adding or removing things from this file as it may alter processing speed and global quality of your result.