Search code examples
pythonms-wordpython-docx

How to fix broken up text with python docx to get free text for Ebooks?


I'm trying to edit a free Ebook I found online into easily readable text for Kindle, with headers and full paragraphs.

I'm very new to Python and coding in general so I don't really have any progress.

Each line is separated by a break with Enter, so each line is considered a separate Paragraph by python.

Basically what needs to be done is delete the space and breaks between the lines so the text doesn't break when converted into MOBI or EPUB.

The text looks like this:

Unformatted: enter image description here

And should look like this:

Formatted: enter image description here

Any help is welcome!


Solution

  • I used the docx library that is not installed by default, you can use pip or conda:

    pip install python-docx
    conda install python-docx --channel conda-forge
    

    After install:

    from docx import Document
    doc = Document(r'path\to\file\pride_and_prejudice.docx')
    all_text=[]
    all_text_str=''
    
    for para in doc.paragraphs:
        all_text.append(para.text)
    
    all_text_str=all_text_str.join(all_text)
    
    clean_text=all_text_str.replace('\n', '')   # Remove linebreaks
    clean_text=clean_text.replace('  ', '')    # Remove even number of spaces (e.g. This usually eliminates non-spaces nicely, but you can tweak accordingly.
    
    document = Document()
    p = document.add_paragraph(clean_text)
    document.save(r'path\to\file\pride_and_prejudice_clean.docx')