01 事務用品・機器
大阪府警察大正警察署:指サック等の購入 :大阪市大正区
01 事務用品・機器
府立学校大阪わかば高等学校:校内衛生用品7件 ★ :大阪市生野区
01 事務用品・機器
府立学校工芸高等学校:イレパネ 他 購入 :大阪市阿倍野区
I want to search with matching a parent keyword and a list of child keyword which i have configured in a json config file like below.
"folder_name": "調達プロジェクト",
"output_file_path": "E:\\output",
"output_file_name": "output.txt",
"parent_keyword": "meeting",
"child_keywords": ["土木一式工事", "産業用機器", "事務用品・機器"]
Now i am trying to find the mail that has these parent child keyword and want to make a text file with the matched keyword and the information (linked text and urls) associated with those keyword. For example for above mail if the keyord mathed with 通信用機器 keyword then i have to extract the text and urls below or associated with this keyword (and rest of the matched keyword) like below.
keyword: matched keyword
Paragraph text: text associated with the keyword
Urls: urls associated with the keyword
Here is what i try with python.
import win32com.client
import os
import json
import logging
import re
def read_config(config_file):
with open(config_file, 'r', encoding="utf-8") as f:
config = json.load(f)
return config
def search_and_save_email(config):
folder_name = config.get("folder_name", "")
output_file_path = config.get("output_file_path", "")
parent_keyword = config.get("parent_keyword", "")
child_keywords = config.get("child_keywords", [])
# Ensure the directory exists
os.makedirs(output_file_path, exist_ok=True)
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
# Find the user-created folder within the Inbox
user_folder = None
for folder in inbox.Folders:
if folder.Name == folder_name:
user_folder = folder
if user_folder is not None:
# Search for emails with the parent keyword anywhere in the subject
parent_keyword_pattern = re.compile(r'\b(?:' + '|'.join(map(re.escape, parent_keyword.split())) + r')\b', re.IGNORECASE)
for item in user_folder.Items:
if parent_keyword_pattern.findall(item.Subject):
logging.info(f"Found parent keyword in Subject: {item.Subject}")
# Parent keyword found, now search for child keywords in the body
body_lower = item.Body.lower()
# Initialize output_text outside the child keywords loop
output_text = ""
for child_keyword in child_keywords:
# Search for child keyword in the body using regular expression
child_keyword_pattern = re.compile(re.escape(child_keyword), re.IGNORECASE)
matches = child_keyword_pattern.finditer(body_lower)
for match in matches:
logging.info(f"Found child keyword '{child_keyword}' at position {match.start()}-{match.end()}")
# Extract the paragraph around the matched position
paragraph_start = body_lower.rfind('\n', 0, match.start())
paragraph_end = body_lower.find('\n', match.end())
paragraph_text = item.Body[paragraph_start + 1:paragraph_end]
# Extract URLs from the paragraph using a simple pattern
url_pattern = re.compile(r'http[s]?://\S+')
urls = url_pattern.findall(paragraph_text)
# Append the results to the output_text
output_text += f"Child Keyword: {child_keyword}\n"
output_text += f"Paragraph Text: {paragraph_text}\n"
output_text += f"URLs: {', '.join(urls)}\n\n"
# Save the result to a text file
output_file = os.path.join(output_file_path, f"{item.Subject.replace(' ', '_')}.txt")
with open(output_file, 'w', encoding='utf-8') as f:
logging.info(f"Saved results to {output_file}")
logging.warning(f"Child keywords not found in folder '{folder_name}'.")
logging.warning(f"Folder '{folder_name}' not found.")
except Exception as e:
logging.error(f"An error occurred: {str(e)}")
if __name__ == "__main__":
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Specify the path to the configuration file
config_file_path = "E:\\config2.json"
# Read configuration from the file
config = read_config(config_file_path)
# Search and save email based on the configuration
Unfortunate the code only given me the matched keyword, not any text and urls associated with those keyword. My output text file is like
Child Keyword: 土木一式工事
Paragraph Text: 土木一式工事
Child Keyword: 産業用機器
Paragraph Text: 19 産業用機器
Child Keyword: 産業用機器
Paragraph Text: 19 産業用機器
I am pretty sure the problem is lied in the logic and the regex expression which i am trying to find out, but i need some help. Sorry for the long information guys.
The problem is that you're looking for a URL in paragraph_text
, which is the text enclosed by the newlines closest to the child_keyword
you find, which is only text like 01 事務用品・機器
, without a URL.
You can instead use a regex pattern that captures child keywords in an alternation pattern, the paragraph text enclosing the keyword, and the URL that follows:
rf'([^\n]*({"|".join(map(re.escape, child_keywords))})[^\n]*).*?\b(https?://\S+)'
so that:
for paragraph_text, child_keyword, url in re.findall(
rf'([^\n]*({"|".join(map(re.escape, child_keywords))})[^\n]*).*?\b(https?://\S+)',
print(f'{paragraph_text=}', f'{child_keyword=}', f'{url=}', sep='\n')
paragraph_text='01 事務用品・機器'
paragraph_text='01 事務用品・機器'
paragraph_text='01 事務用品・機器'