Search code examples
pythontext-miningpython-docxdocx2txt

Python get multiple docx file names and extract specific words from the files to generate a dataframe or table


I hope to read multiple word documents (docx files) in a folder and then search a specific word e.g. "laptop" from each of docx file to generate a table or a dataframe. For instance: in my folder I have file_1.docx, file_2.docx ... file_n.docx, each file may or may not contain work "Laptop". In the end I hope to generate a table like:

FileName          Keyword
file_1.docx       "laptop"
file_2.docx       "laptop"
...

Solution

  • If you are using Python3.X you will need to do

    pip install python-docx

    Not to be confuse with docx as I had some issues using this.

    import os
    from docx import Document
    import pandas as pd
    
    match_word = "laptop"
    match_items = []
    folder = 'C:\\Dev\\Docs'
    file_names = os.listdir(folder)
    file_names = [file for file in file_names if file.endswith('.docx')]
    file_names = [os.path.join(folder, file) for file in file_names]
    
    For file in file_names:
        document = Document(file)
        for paragraph in document.paragraphs:
            if match_word in paragraph.text:
                match_items.append([file, match_word])
    
    the_df = pd.DataFrame(
        match_items,
        columns=['file_name', 'word_match'],
        index=[i[0] for i in match_items]
    )
    
    print(the_df)
    

    Output:

    file_name              word_match
    C:\Dev\Docs\c.docx     laptop