Search code examples
javaannotationsnlpgate

Creating new annotation sets in GATE


I have started learning GATE application and I would like to use it to extract information from an unstructured document. The information I am interested in are date, location, event information and person’s names. I would like to get information about events that happened at a specific location on a specific date and the person/s name. I have been reading the GATE manual and thats how I got the glimpse on how to build your pipeline. However, I am not figuring out how I can create my new annotation types and make sure that they are annotated to a new annotation set which should appear under the annotation sets on the right. I found similar questions like GATE - How to create a new annotation SET? but it didn help me either.

Let me explain what I did so far:

  1. Created .lst file for my new NE and put them under ANNIE resources/gazetteer directory
  2. I added the .lst file description in the list.def file
  3. I identified my patterns in the document e.g for Date formats like ddmm, dd.mm.yyyy

  4. I wrote JAPE rule for each pattern in a separate .jape file

  5. Added the JAPE file names into the main.jape file
  6. Loaded the PR and my document into GATE
  7. Run the application

This is how my JAPE Rule looks like for one date format:

    Phase: datesearching
    Input: Token Lookup SpaceToken
    Options: control = appelt

    ////////////////////////////////////Macros
    //Initialization of regular expressions
    Macro: DAY_ONE
    ({Token.kind == number,Token.category==CD, Token.length == "1"})

    Macro: C
    ({Token.kind == number,Token.category==CD, Token.length == "2"})

    Macro: YEAR
    ({Token.kind == number,Token.category==CD, Token.length == "4"})

    Macro: MONTH
    ({Lookup.majorType=="Month"})

    Rule: ddmmyyydash
    (
        (DAY_ONE|DAY_TWO)
        ({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
        (MONTH)
        ({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
        (YEAR)
    )
    :ddmmyyyydash
    -->
        :ddmmyyyydash.DateMonthYearDash= {rule = "ddmmyyyydash"}

Can someone please help me with what I should do to make sure that DateMonthYearDash is created as a new annotation set? How do I do it? Thanks a lot.

When I change the outputAsName of the Jape Transducer the new set is not appearing like the rest. This is how it looks:

annotation set list


Solution

  • As said, linked or quoted in the question you mention (GATE - How to create a new annotation SET?), you have two options:

    1. Change the outputASName of your JAPE transducer PR.
    2. Use Annotation Set Transfer PR to copy or move desired annotations from one annotation set to another one.

    JAPE function - explanation

    JAPE transducer (similarly to many other GATE PRs) simply takes some input annotations and based on them it creates some new output annotations. The input and output annotation sets names can be configured by inputASName and outputASName run-time parameters. inputASName says where it should look for input annotations and outputASName says where it should put output annotations to.

    What should be where

    The input annotation set must contain the necessary input annotations before the JAPE transducer PR is executed. These annotations are usually created by preceding PRs in the pipeline. Otherwise it will not see the necessary input annotations and it will not produce anything.

    The output annotation set may be empty or it may contain anything before the JAPE execution. It doesn't matter. The import thing is that the new output annotations (DateMonthYearDash in your case) are created there when the JAPE transducer PR execution finished.
    So after successful JAPE execution you should see the new annotations there.

    Some terminology

    Note that annotation sets have names.
    While annotations have type, id, offsets, features and annotation set they belong to.


    JAPE correction

    I found some issues in your JAPE grammar:

    1. Don't include SpaceToken unless you explicitly use them in your grammar or you are sure there will be none inside the pattern... See also: Concept of Space Token in JAPE
    2. ({Lookup.majorType=="Month"}) -> ({Lookup.minorType=="month"})
    3. (DAY_ONE|DAY_TWO) -> (DAY_ONE)

    After corrections + after ANNIE pipeline for document 9 - January - 2017: GATE doc output

    JAPE grammar after corrections:

    Phase: datesearching
        Input: Token Lookup
        Options: control = appelt
    
        Macro: DAY_ONE
        ({Token.kind == number,Token.category==CD, Token.length == "1"})
    
        Macro: YEAR
        ({Token.kind == number,Token.category==CD, Token.length == "4"})
    
        Macro: MONTH
        ({Lookup.minorType=="month"})
    
        Rule: ddmmyyydash
        (
            (DAY_ONE)
            ({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
            (MONTH)
            ({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
            (YEAR)
        )
        :ddmmyyyydash
        -->
            :ddmmyyyydash.DateMonthYearDash= {rule = "ddmmyyyydash"}
    

    What to do when JAPE does not produce anything

    You have to investigate the input annotations and "debug" your JAPE grammar. Usually there is some expected input annotation missing or there is some extra annotation you did not expect to be there. There is a nice view in GATE for this purpose: annotation stack. Also some features of input annotations can have different name or value than you expected (e.g. What is correct: {Lookup.majorType=="Month"} or {Lookup.minorType=="month"}?).

    By "debugging" a JAPE grammar I mean: try to simplify the rule as far as it starts working. Keep trying it on a simple document where it should match for sure. So in your case you can try it without the (DAY_ONE) part. If it still doesn't work, try only (MONTH)({Token.string == "-"})(YEAR), or even (MONTH) only, etc. Until you find the mistake in the grammar...