I am using Apache Tika 1.9
and content extraction working awesome.
The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document.
I had searched a lot and tried some solutions suggested by users, but did not work for me, may be due to latest Tika version.
Please suggest any solution or further research direction for this.
I will be thankful.
NOTE: I am using JRuby for implementation
Here is the class for custom content handler that I created and which solved my issue.
class PageContentHandler < ToXMLContentHandler
attr_accessor :page_tag
attr_accessor :page_number
attr_accessor :page_class
attr_accessor :page_map
def initialize
@page_number = 0
@page_tag = 'div'
@page_class = 'page'
@page_map = Hash.new
end
def startElement(uri, local_name, q_name, atts)
start_page() if @page_tag == q_name and atts.getValue('class') == @page_class
end
def endElement(uri, local_name, q_name)
end_page() if @page_tag == q_name
end
def characters(ch, start, length)
if length > 0
builder = StringBuilder.new(length)
builder.append(ch)
@page_map[@page_number] << builder.to_s if @page_number > 0
end
end
def start_page
@page_number = @page_number + 1
@page_map[@page_number] = String.new
end
def end_page
return
end
end
And to use this content handler, here is the code:
parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
puts handler.page_map