extract text from pptx, ppt, docx, doc and msg files python windows

Is there a way to extract text from pptx, ppt, docx, doc and msg files on windows machine? I have few hundreds of these file and need some programmatic way. I would prefer Python. But I am open to other suggestions

I searched online and saw some discussions but they were applicable to linux machines

Solution

Word

I tried for word something with python-docx, to install it write pip install python-docx. I had a word doc called example with 4 lines of text in there that were grabbed in the right way like you see in the output below.

from docx import Document

d = Document("example.docx")

for par in d.paragraphs:
    print(par.text)

output (the example.docx content):

Titolo
Paragrafo 1 a titolo di esempio
This is an example of text
This is the final part, just 4 rows

Join all the text of docx in a folder

import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]
text_collector = []
whole_text = ''
for f in files:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)

As above, but with choise

In this code you are asked to choose the file that you want to join froma list that appears of the docx file in the folder.

import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]

for n,f in enumerate(files):
    print(n+1,f)
print()
print("Write the numbers of files you need separated by space")
inp = input("Which files do you want to join?")

desired = (inp.split())
desired = map(lambda x: int(x), desired)
list_to_join = []
for n in desired:
    list_to_join.append(files[n-1])


text_collector = []
whole_text = ''
for f in list_to_join:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)