Search code examples
gate

How to get the name of the document, the pipeline is currently working on?


Let's say, a corpus have 1k docs, and be processed by a pipeline.
At some point, the pipeline stucks, throws exception or have funny behavior. But all these are very likely to be document-relevant.
So it'd be nice to know which document is being processed in the pipeline. For example, to print out the doc name in a Jape transducer.


Solution

  • To get document processing you can write a simple JAPE rule like:

    Phase:  DocName
    Input: Token
    Options: control = once
    
    Rule:DocName
    (
     {Token}
    )
    -->
    {
      System.out.println(doc.getName());
    }
    

    Put this rule as a first rule in your pipeline. I hope that you have a least 1 Token in the document.