Search code examples
uimaruta

UIMA Ruta input type - html


I have pdf and word files that need to be used as an input for Ruta. I can convert them into text files, but lose all the tables and formatting if I do that. Is there anyway I can use them without losing any information?

Thanks!


Solution

  • You need an additional program that is able to convert pdf (/doc/docx) to html. There are mainly two different types of PDF converter: those which use absolute positions for generating nice-looking html, and those which rely only on html elements and css. For processing tables, I recommend the latter ones. I personally use a commerical solution, but there is also a lot of good open source software, e.g., pdf2htmlEX

    If you have html, then you can apply the HtmlAnnotator and HtmlConverter for gaining plain text with annotations for the html tags as described in the UIMA Ruta documentation