Search code examples
python-3.xfilepandoctext-extractionread-write

Reading and writing non .txt files with Python


Please help me to figure out the more adequate solution to the following problem, if you will.

I am writing the code for an app that replaces words on a text, from a certain way of writing them to another way of writing it; as, for example, replace all the words “colour” with “color”.

The code is doing it, but only by reading from and writing to .txt files, which Python does plainly.

However, I would like it to read from (and eventually write to) other text formats. So, I went look for solutions and I found two: textract and pandoc. Textract requires the previous installation of a full list of libraries and programs.

Pandoc only requires you to install itself before pip install pypandoc and use it, which looked nicer. (Pandoc installing guide refer that if you want also to be able to write on PDF you will need to install LaTex...)

My aim is to built a platform independent app. My question(s) is(are):

Will the app user have to install Pandoc (and eventually LaTex) on his machine to be able to use the app?

Would it be preferable (although very unprofessional, I suppose) to warn the user that he must use only (copy and paste to) .txt files with the app?


Solution

  • A few "moons" later I have an answer for my own question. So, I am sharing it. (Is it not the reason we all come here?) The code for the app is complete and working, and actually I ended not using either textract or pandoc. Here's a list of the modules I used instead: PyPDF, docx, ezodf, beautifulsoup, ebooklib - and some others as auxiliaries.

    I am not pleased for having so many imports. Someone told me that I may do all this just using the NLTK library. Anyone confirms this before I hook myself on studying the thing? Thank you.