Search code examples
jrubyapache-tika

Apache Tika Document Content Extraction Per Page


I am using Apache Tika 1.9 and content extraction working awesome.

The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document.

I had searched a lot and tried some solutions suggested by users, but did not work for me, may be due to latest Tika version.

Please suggest any solution or further research direction for this.

I will be thankful.

NOTE: I am using JRuby for implementation


Solution

  • Here is the class for custom content handler that I created and which solved my issue.

    class PageContentHandler < ToXMLContentHandler
            attr_accessor :page_tag
            attr_accessor :page_number
            attr_accessor :page_class
            attr_accessor :page_map
    
            def initialize
              @page_number = 0
              @page_tag = 'div'
              @page_class = 'page'
              @page_map = Hash.new
            end
    
            def startElement(uri, local_name, q_name, atts)
              start_page() if @page_tag == q_name and atts.getValue('class') == @page_class
            end
    
            def endElement(uri, local_name, q_name)
              end_page() if @page_tag == q_name
            end
    
            def characters(ch, start, length)
              if length > 0
                builder = StringBuilder.new(length)
                builder.append(ch)
                @page_map[@page_number] << builder.to_s if @page_number > 0
              end
            end
    
            def start_page
              @page_number = @page_number + 1
              @page_map[@page_number] = String.new
            end
    
            def end_page
              return
            end
          end
    

    And to use this content handler, here is the code:

    parser = AutoDetectParser.new
    handler = PageContentHandler.new
    parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
    puts handler.page_map