Search code examples
gate

Difficulty using JAPE Grammar


I have a document which contains sections such as Assessments, HPI, ROS, Vitals etc. I want to extract notes in each section. I am using GATE for this purpose. I have made a JAPE file which will extract notes in the Assessment section. Following is the grammar,

Input: Token
Options: control=appelt debug=true

Rule: Assess
({Token.string =~"(?i)diagnose[d]?"}{Token.string=="with"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)from"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)with"})

(
({Token})*
):assessments

({Token.string =~"(?i)HPI"} | {Token.string =~"(?i)ROS"} | {Token.string =~"(?i)EXAM"} | {Token.string =~"(?i)VITAL[S]"} | {Token.string =~"(?i)TREATMENT[s]"} |{Token.string=~"(?i)use[d]?"}{Token.string=~"(?i)orderset[s]?"} | {Token.string=~"$"})


-->
:assessments.Assessments = {}

Now, when the assessment section is in the end of the document I can retrieve the notes properly. But if it is somewhere between two sections then this will return entire document from assessment section till the end of file.

I have tried using {Token.string=~"$"} in different ways but could not extract ONLY THE ASSESSMENT SECTION IRRESPECTIVE OF ITS PLACE IN THE DOC.

Please explain how can I achieve this using JAPE grammar.


Solution

  • That is correct since Appelt mode always prefers the longest possible overall match. Since any Token can match string =~ "$" the assessments label will grab all but the final token in the document.

    I would adopt a two pass approach, using an initial gazetteer or JAPE phase to annotate the "section headings" and then another phase with only these heading annotations in its input line

    Imports: { import static gate.Utils.*; }
    Phase: AnnotateBetweenHeadings
    Input: Heading
    Options: control = appelt
    
    Rule: TwoHeadings
    ({Heading.type ="assessments"}):h1
    (({Heading})?):h2
    -->
    {
      Long endOffset = end(doc);
      AnnotationSet h2Annots = bindings.get("h2");
      if(h2Annots != null && !h2Annots.isEmpty()) {
        endOffset = start(h2Annots);
      }
      outputAS.add(end(bindings.get("h1")), endOffset, "Assessments", featureMap());
    }
    

    This will annotate everything between the end of the assessments heading and the start of the following heading, or the end of the document if there is no following heading.