I am trying to extract text from a large pdf (not scanned/rasterized pdf) file using apache tika.
But after extracting text when I compare the original (from the pdf) and extracted text, I found that, lot of text content is missing. I have tried using setMaxStringLength(-1)
and BodyContentHandler(-1)
to maximize the output. But still not able to extract the full text content from the pdf file.
Below are the two samples I have tried.
Sample: 1
public class Extract
{
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
File file = new File("1.pdf");
//Instantiating Tika facade class
Tika tika = new Tika();
tika.setMaxStringLength(-1);
String filecontent = tika.parseToString(file);
System.out.println("Extracted Content: " + filecontent);
}
}
Sample: 2
public class Extract
{
public static void main( String[] args ) throws IOException, SAXException, TikaException
{
BodyContentHandler handler = new BodyContentHandler(-1); //-1 to allow parsing for unlimited character
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("1.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name+ " : " + metadata.get(name));
}
}
}
I am able to see contents from the last page of the pdf. But randomly lot of texts are missing from the pdf.
This was a stupidest mistake from my side. I was taking the output file from eclipse console which has a limited buffer space. When I wrote the output into a file, it seems to be perfect.