Search code examples
javapdfapache-tikatext-extraction

Apache Tika could not extract full text content from a large pdf


I am trying to extract text from a large pdf (not scanned/rasterized pdf) file using apache tika.

But after extracting text when I compare the original (from the pdf) and extracted text, I found that, lot of text content is missing. I have tried using setMaxStringLength(-1) and BodyContentHandler(-1) to maximize the output. But still not able to extract the full text content from the pdf file.

Below are the two samples I have tried.

Sample: 1

public class Extract 
{
    public static void main( String[] args ) throws IOException, SAXException, TikaException
    {
        File file = new File("1.pdf");

        //Instantiating Tika facade class
        Tika tika = new Tika();
        tika.setMaxStringLength(-1);
        String filecontent = tika.parseToString(file);
        System.out.println("Extracted Content: " + filecontent);
    }
}

Sample: 2

public class Extract 
{
    public static void main( String[] args ) throws IOException, SAXException, TikaException
    {
        BodyContentHandler handler = new BodyContentHandler(-1); //-1 to allow parsing for unlimited character
        Metadata metadata = new Metadata();
        FileInputStream inputstream = new FileInputStream(new File("1.pdf"));
        ParseContext pcontext = new ParseContext();

        //parsing the document using PDF parser
        PDFParser pdfparser = new PDFParser(); 
        pdfparser.parse(inputstream, handler, metadata,pcontext);

        //getting the content of the document
        System.out.println("Contents of the PDF :" + handler.toString());

        //getting metadata of the document
        System.out.println("Metadata of the PDF:");
        String[] metadataNames = metadata.names();

        for(String name : metadataNames) {
            System.out.println(name+ " : " + metadata.get(name));
        }
    }
}

I am able to see contents from the last page of the pdf. But randomly lot of texts are missing from the pdf.


Solution

  • This was a stupidest mistake from my side. I was taking the output file from eclipse console which has a limited buffer space. When I wrote the output into a file, it seems to be perfect.