Search code examples
pythonpowerpointdocx

extract text from pptx, ppt, docx, doc and msg files python windows


Is there a way to extract text from pptx, ppt, docx, doc and msg files on windows machine? I have few hundreds of these file and need some programmatic way. I would prefer Python. But I am open to other suggestions

I searched online and saw some discussions but they were applicable to linux machines


Solution

  • Word

    I tried for word something with python-docx, to install it write pip install python-docx. I had a word doc called example with 4 lines of text in there that were grabbed in the right way like you see in the output below. enter image description here

    from docx import Document
    
    d = Document("example.docx")
    
    for par in d.paragraphs:
        print(par.text)
    

    output (the example.docx content):

    Titolo
    Paragrafo 1 a titolo di esempio
    This is an example of text
    This is the final part, just 4 rows
    

    Join all the text of docx in a folder

    import os
    from docx import Document
    
    files = [f for f in os.listdir() if ".docx" in f]
    text_collector = []
    whole_text = ''
    for f in files:
        doc = Document(f)
        for par in doc.paragraphs:
            text_collector.append(par.text)
    
    for text in text_collector:
        whole_text += text + "\n"
    
    print(whole_text)
    

    As above, but with choise

    In this code you are asked to choose the file that you want to join froma list that appears of the docx file in the folder.

    import os
    from docx import Document
    
    files = [f for f in os.listdir() if ".docx" in f]
    
    for n,f in enumerate(files):
        print(n+1,f)
    print()
    print("Write the numbers of files you need separated by space")
    inp = input("Which files do you want to join?")
    
    desired = (inp.split())
    desired = map(lambda x: int(x), desired)
    list_to_join = []
    for n in desired:
        list_to_join.append(files[n-1])
    
    
    text_collector = []
    whole_text = ''
    for f in list_to_join:
        doc = Document(f)
        for par in doc.paragraphs:
            text_collector.append(par.text)
    
    for text in text_collector:
        whole_text += text + "\n"
    
    print(whole_text)