Search code examples
javapdfweb-crawlerdoc

How to extract text from pdf and doc file without downloading


I have searched a lot before asking that question. I have a program(java) which crawls some wep pages and trying to find some .doc and .pdf files and it can download them but only one .pdf or .doc can cover up to 3-4mb which is not good because there are millions of files.. so I decied to extract their text without downloading the whole file. Basically, I need to see pdf or doc file online and download their text only but I could not figure out how to do that. If necessary I can provide my code.

Edit:This question can be closed now since I got the idea and (no)solution. Thanks for help.

And What's up with those downgrades on question ?


Solution

  • That is not possible. You can only start extracting the document once you download the bytes.

    (unless you also have control over the server, you could do the extraction server-side and provide a txt download link)