Search code examples
html-parsinguimaruta

UIMA Ruta: Make HTMLAnnotator annotate more tags


I'm relatively new to UIMA Ruta and I need to process HTML documents. I already have a ProcessHTML.ruta script which is basically the same as in the documentation (with minor adjustments):

ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE HtmlViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.SourceDocumentInformation;

Document{->CONFIGURE(HtmlAnnotator, "onlyContent"=true), EXEC(HtmlAnnotator, {TAG})};

Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView",
    "outputView" = "plain", "expandOffsets"=false, "replaceLinebreaks"=true, "skipWhitespacs"=true, "linebreakReplacement"=" ", "processAll"=true),
      EXEC(HtmlConverter)};

Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",
    "outputView" = "_InitialView", "output" = "../converted/"),
    EXEC(HtmlViewWriter)};

I noticed that I might need layout information from the HTML source for my next script which is not present currently. For example, text is often marked up with tags, but there is no STRONG annotation in the output. If I understand correctly, all tags not implemented in HTMLTypeSystem are annotated with a default TAG annotation.

Is it possible to define additional annotations for specific HTML tags to be retained? Is there some configuration for this or do I need to extend the annotator somehow?


Solution

  • Adding the following to HTMLTypeSystem.xml did the trick:

    <typeDescription>
        <name>org.apache.uima.ruta.type.html.STRONG</name>
        <description></description>
        <supertypeName>org.apache.uima.ruta.type.html.TAG</supertypeName>
    </typeDescription>
    

    (Kudos to a colleague who figured that one out)