Search code examples
itextocrpdfboxtext-extraction

How to set the FSM configuaration for Textricator PDF OCR reader?


I'm trying to use the PDF document parser called Textricator. It can use 3 different methods for parsing a PDF with some common OCR libraries. (itext5, itext7, pdfbox) The available methods are: text, table and form. Text for normal raw OCR recognition, table to read out structured table data, and form for parsing less structured forms, using a Finite State Machine (FSM).

However, I am not able to use the form parser. Perhaps I simply don't understand how to organize the many configuration states. The documentation is lacking a simple form example, and someone recently posted an attempt to read a very basic table using the form method, but was not able to. I also gave it a shot, but without any success.

Q: Can someone help me configure the state machine in the YML file?
(This is used to parse the demo file from one of that repo's issues, and shown in the copied screenshot below.)


enter image description here


The YML configuration file.


extractor: "pdf.pdfbox"

header:
  default: 100
footer:
   default: 600

maxRowDistance: 2

rootRecordType: item
recordTypes:
  item:
    label: "item"
    valueTypes:
      - item
      - date
      - description
      - order_number
      - quantity
      - price

valueTypes:
  item:
    label: "Item"
  date:
    label: "Date"
  description:
    label: "Description"
  order_number:
    label: "OrderNo"
  quantity:
    label: "Qty"
  price:
    label: "Price"
 
initialState: "INIT"

states:
  INIT:
    transitions:
      -
        condition: item
        nextState: item

  item:
    startRecord: true
    transitions:
      -
        condition: date
        nextState: date  

  date:
    include: true
    transitions:
      -
        condition: description
        nextState: description  

  description:
    include: true
    transitions:
      -
        condition: description
        nextState: description     
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  order_number:
    include: true
    transitions:
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  quantity:
    include: true
    transitions:
      -
        condition: price
        nextState: price

  price:
    include: true
    transitions:
      -
        condition: end
        nextState: end

  end:
    include: false
    transitions:
      -
        condition: any
        nextState: end

conditions:

  item:         '73 < ulx < 110 and text =~ /(\\d)*/'
  date:         '110 < ulx < 181 and text =~ /([0-9\-]*)/'
  description:  '193 < ulx < 366'
#  order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
  order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
  quantity:     '393 < ulx < 459'
  price:        '459 < ulx < 523'

  end:          'text =~ /(Footer)/'
  any: "1 = 1"

You may wonder why I am insisting in using the form processor for this simple example, but it is because in my real life document I will have a much more complex sub-structure of child items under the Description field. This can only (?) be processed efficiently by a state-machine, AFAIK.

But, maybe this is not the right tool for the job? So what other options are there?


UPDATE: (2021-05-18)

The author of Textricate has now bumped the libraries used, the documentation and corrected several working examples and user issues. Thanks to user mweber I now have a perfectly working parser and no longer need to use awk to handle weird columns.


Solution

  • As Textricator is kind of a hidden gem for pdf parsing imo, I'm happy to see someone using it and posted a config working with the sample document to the github issue:

    extractor: "pdf.pdfbox"
    
    header:
      default: 100
    footer:
      default: 600
    
    maxRowDistance: 2
    
    rootRecordType: item
    recordTypes:
      item:
        label: "item"
        valueTypes:
          - item
          - date
          - description
          - order_number
          - quantity
          - price
    
    valueTypes:
      item:
        label: "Item"
      date:
        label: "Date"
      description:
        label: "Description"
      order_number:
        label: "OrderNo"
      quantity:
        label: "Qty"
      price:
        label: "Price"
    
    initialState: "INIT"
    
    states:
      INIT:
        include: false
        transitions:
          -
            condition: item
            nextState: item
          - condition: any
            nextState: INIT
    
      item:
        startRecord: true
        transitions:
          -
            condition: date
            nextState: date  
    
      date:
        include: true
        transitions:
          -
            condition: description
            nextState: description  
    
      description:
        include: true
        transitions:
          -
            condition: description
            nextState: description     
          -
            condition: order_number
            nextState: order_number
          -
            condition: quantity
            nextState: quantity
          -
            condition: item
            nextState: item
    
      order_number:
        include: true
        transitions:
          -
            condition: order_number
            nextState: order_number
          -
            condition: quantity
            nextState: quantity
    
      quantity:
        include: true
        transitions:
          - 
            condition: price
            nextState: price
    
      price:
        include: true
        transitions:
          -
            condition: end
            nextState: end
          - 
            condition: description
            nextState: description
          -
            condition: item
            nextState: item
    
      end:
        include: false
        transitions:
          -
            condition: any
            nextState: end
    
    conditions:
    
      item:         '73 < ulx < 110 and text =~ /(\\d)*/'
      date:         '110 < ulx < 181 and text =~ /([0-9\\-]*)/'
      description:  '193 < ulx < 366'
      order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\\-)(([0-9]{2}))/'
      quantity:     '393 < ulx < 459'
      price:        '459 < ulx < 523'
    
      end:          'text =~ /(Footer)/'
      any: "1 = 1"