I hope to read multiple word documents (docx files) in a folder and then search a specific word e.g. "laptop" from each of docx file to generate a table or a dataframe. For instance: in my folder I have file_1.docx, file_2.docx ... file_n.docx, each file may or may not contain work "Laptop". In the end I hope to generate a table like:
FileName Keyword
file_1.docx "laptop"
file_2.docx "laptop"
...
If you are using Python3.X you will need to do
pip install python-docx
Not to be confuse with docx as I had some issues using this.
import os
from docx import Document
import pandas as pd
match_word = "laptop"
match_items = []
folder = 'C:\\Dev\\Docs'
file_names = os.listdir(folder)
file_names = [file for file in file_names if file.endswith('.docx')]
file_names = [os.path.join(folder, file) for file in file_names]
For file in file_names:
document = Document(file)
for paragraph in document.paragraphs:
if match_word in paragraph.text:
match_items.append([file, match_word])
the_df = pd.DataFrame(
match_items,
columns=['file_name', 'word_match'],
index=[i[0] for i in match_items]
)
print(the_df)
Output:
file_name word_match
C:\Dev\Docs\c.docx laptop