Apache Tika Document Content Extraction Per Page

I am using Apache Tika 1.9 and content extraction working awesome.

The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document.

I had searched a lot and tried some solutions suggested by users, but did not work for me, may be due to latest Tika version.

Please suggest any solution or further research direction for this.

I will be thankful.

NOTE: I am using JRuby for implementation

Solution

Here is the class for custom content handler that I created and which solved my issue.

class PageContentHandler < ToXMLContentHandler
        attr_accessor :page_tag
        attr_accessor :page_number
        attr_accessor :page_class
        attr_accessor :page_map

        def initialize
          @page_number = 0
          @page_tag = 'div'
          @page_class = 'page'
          @page_map = Hash.new
        end

        def startElement(uri, local_name, q_name, atts)
          start_page() if @page_tag == q_name and atts.getValue('class') == @page_class
        end

        def endElement(uri, local_name, q_name)
          end_page() if @page_tag == q_name
        end

        def characters(ch, start, length)
          if length > 0
            builder = StringBuilder.new(length)
            builder.append(ch)
            @page_map[@page_number] << builder.to_s if @page_number > 0
          end
        end

        def start_page
          @page_number = @page_number + 1
          @page_map[@page_number] = String.new
        end

        def end_page
          return
        end
      end

And to use this content handler, here is the code:

parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
puts handler.page_map