Search code examples
file-uploadclojure

Clojure - Best way to extract PDF/Doc files into simple text


I'm looking for a simple solution to parse every file that has been uploaded to my app, and to convert into simple text. My web app runs on Clojure and prefer the API to parse all kinds of file types.


Solution

  • Take a look at apache poi, pdfbox and apache tika .

    They are java libraries for working with various file formats. You can use their java APIs directly in your clojure app.

    Here is a quote from apache tika website.

    The Apache Tika™ toolkit detects and extracts metadata and text content from various documents - from PPT to CSV to PDF - using existing parser libraries. Tika unifies these parsers under a single interface to allow you to easily parse over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more.

    Here is a quote from pdfbox website.

    The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents

    And here is a quote from poi website

    For a number of years now, Apache POI has provided basic text extraction for all the project supported file formats. In addition, as well as the (plain) text, these provides access to the metadata associated with a given file, such as title and author.