Search code examples
python-3.xpdfepub

Python libraries and ebook/pdf files management


I have tons of books in digital format, more in pdf but many in epub format also. They are so many that it is difficult to order them in folders, may may be part of 2 folders so they are saved in one folder and in other folders there is just a link to the file. I searched for an ebook reader software able by itself to distinguish and attach any book to a set but I did not find so much. So, I decidet to write a little Python program able to do that and then open the default reader for the file. For these reasons I am serching any Python lib able to read pdf files and another for epub files. I mean a couple of libs able to read tags/meta-tags inside a file and then decide what is the right folder/place to save it.


Solution

  • The range of available solutions for Python-related PDF tools, modules, and libraries is a bit confusing, and it takes a moment to figure out what is what, and which projects are maintained continuously. Based on our research these are the candidates that are up-to-date:

    PyPDF2: A Python library to extract document information and content, split documents page-by-page, merge documents, crop pages, and add watermarks. PyPDF2 supports both unencrypted and encrypted documents.

    PDFMiner: Is written entirely in Python, and works well for Python 2.4. For Python 3, use the cloned package PDFMiner.six. Both packages allow you to parse, analyze, and convert PDF documents. This includes the support for PDF 1.7 as well as CJK languages (Chinese, Japanese, and Korean), and various font types (Type1, TrueType, Type3, and CID).

    PDFQuery: It describes itself as "a fast and friendly PDF scraping library" which is implemented as a wrapper around PDFMiner, lxml, and pyquery. Its design aim is "to reliably extract data from sets of PDFs with as little code as possible."

    tabula-py: It is a simple Python wrapper of tabula-java, which can read tables from PDFs and convert them into Pandas DataFrames. It also enables you to convert a PDF file into a CSV/TSV/JSON file.

    pdflib for Python: An extension of the Poppler Library that offers Python bindings for it. It allows you to parse, analyze, and convert PDF documents. Not to be confused with its commercial pendant that has the same name.

    PyFPDF: A library for PDF document generation under Python. Ported from the FPDF PHP library, a well-known PDFlib-extension replacement with many examples, scripts, and derivatives.

    PDFTables: A commercial service that offers extraction from tables that comes as a PDF document. Offers an API so that PDFTables can be used as SAAS.

    PyX - the Python graphics package: PyX is a Python package for the creation of PostScript, PDF, and SVG files. It combines an abstraction of the PostScript drawing model with a TeX/LaTeX interface. Complex tasks like creating 2D and 3D plots in publication-ready quality are built out of these primitives.

    ReportLab: An ambitious, industrial-strength library largely focused on precise creation of PDF documents. Available freely as an Open Source version as well as a commercial, enhanced version named ReportLab PLUS.

    PyMuPDF (aka "fitz"): Python bindings for MuPDF, which is a lightweight PDF and XPS viewer. The library can access files in PDF, XPS, OpenXPS, epub, comic and fiction book formats, and it is known for its top performance and high rendering quality.

    pdfrw: A pure Python-based PDF parser to read and write PDF. It faithfully reproduces vector formats without rasterization. In conjunction with ReportLab, it helps to re-use portions of existing PDFs in new PDFs created with ReportLab.