Search code examples
javaandroidpdfbox

PDFBox in Android or other means to extract text from PDF on device?


My app need to process input from PDF files consisting of text (mostly). I could do the parsing on my server, but I'd prefer not to. Anyway, after exploring my options for text extraction I found PDFBox library and its port to use with Android (https://github.com/TomRoush/PdfBox-Android)

In the app I show my users a standard UI for selecting the source document through ACTION_OPEN_DOCUMENT. Then override onActivityResult to get Uri - you know, the usual stuff.

The problem is that I can't figure out how to feed it to PDFBox. Since we're not talking "files" but rather "documents" and the lib wants a real file path. If I provide it with it for a certain file, the text parsing goes okay, but it's certainly not the best practice and it can't be done for all documents out there (cloud storage etc) so instead I do this:

InputStream inputStream = getContentResolver().openInputStream(uri);

and then read it line by line so in the end I can have it all in one string. Obviously, it works okay.

But how to actually input this data into PDFBox to do its text extraction magic? I can't find any docs on how to do it in a scenario when I don't have the "real file path".

Maybe there are better ways now? This library is quite old.. Basically I need to extract text from PDF and do it on an Android device, not through an API call. Really stuck here.


Solution

  • I needed similar functionality for my app so I've tried solution suggested by Mike M. in comments under your question and it worked great for me (so this is really his answer – I just confirmed that it works and supplied the code). Hope it helps.

    The “magic” is actually in these two lines:

    InputStream inputStream = this.getContentResolver().openInputStream(fileUri);
    document = PDDocument.load(inputStream);
    

    But for some context (and for those who will search an answer for this problem on another occasion) here is whole example code:

    public class MainActivity extends AppCompatActivity {
    
        private static final int OPEN_FILE_REQUEST_CODE = 1;
        Intent intentOpenfile;
        Uri fileUri;
    
        TextView tvTextDisplay;
        Button bOpenFile;
    
        @Override
        protected void onCreate(Bundle savedInstanceState) {
            super.onCreate(savedInstanceState);
            setContentView(R.layout.activity_main);
    
            tvTextDisplay = findViewById(R.id.tv_text_display);
    
            PDFBoxResourceLoader.init(getApplicationContext());
    
            bOpenFile = findViewById(R.id.b_open_file);
            bOpenFile.setOnClickListener(new View.OnClickListener() {
                @Override
                public void onClick(View v) {
                    intentOpenfile = new Intent(Intent.ACTION_OPEN_DOCUMENT);
                    intentOpenfile.setType("application/pdf");
                    startActivityForResult(intentOpenfile, OPEN_FILE_REQUEST_CODE);
                }
            });
        }
    
        @Override
        protected void onActivityResult(int requestCode, int resultCode, @Nullable Intent data) {
            super.onActivityResult(requestCode, resultCode, data);
            if (requestCode == OPEN_FILE_REQUEST_CODE) {
                if(resultCode == RESULT_OK) {
                    fileUri = data.getData();
                    PDDocument document = null;
                    String parsedText = null;
                    try {
                        InputStream inputStream = this.getContentResolver().openInputStream(fileUri);
                        document = PDDocument.load(inputStream);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
    
                    try {
                        PDFTextStripper pdfStripper = new PDFTextStripper();
                        pdfStripper.setStartPage(0);
                        pdfStripper.setEndPage(1);
                        parsedText = "Parsed text: " + pdfStripper.getText(document);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }finally {
                        try {
                            if (document != null) document.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                    tvTextDisplay.setText(parsedText);
    
                }
            }
        }
    }