01 事務用品・機器
大阪府警察大正警察署:指サック等の購入 :大阪市大正区
https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042214
01 事務用品・機器
府立学校大阪わかば高等学校:校内衛生用品7件 ★ :大阪市生野区
https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350041978
01 事務用品・機器
府立学校工芸高等学校:イレパネ 他 購入 :大阪市阿倍野区
https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042117
I want to search with matching a parent keyword and a list of child keyword which i have configured in a json config file like below.
{
"folder_name": "調達プロジェクト",
"output_file_path": "E:\\output",
"output_file_name": "output.txt",
"parent_keyword": "meeting",
"child_keywords": ["土木一式工事", "産業用機器", "事務用品・機器"]
}
Now i am trying to find the mail that has these parent child keyword and want to make a text file with the matched keyword and the information (linked text and urls) associated with those keyword. For example for above mail if the keyord mathed with 通信用機器 keyword then i have to extract the text and urls below or associated with this keyword (and rest of the matched keyword) like below.
keyword: matched keyword
Paragraph text: text associated with the keyword
Urls: urls associated with the keyword
Here is what i try with python.
import win32com.client
import os
import json
import logging
import re
def read_config(config_file):
with open(config_file, 'r', encoding="utf-8") as f:
config = json.load(f)
return config
def search_and_save_email(config):
try:
folder_name = config.get("folder_name", "")
output_file_path = config.get("output_file_path", "")
parent_keyword = config.get("parent_keyword", "")
child_keywords = config.get("child_keywords", [])
# Ensure the directory exists
os.makedirs(output_file_path, exist_ok=True)
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
# Find the user-created folder within the Inbox
user_folder = None
for folder in inbox.Folders:
if folder.Name == folder_name:
user_folder = folder
break
if user_folder is not None:
# Search for emails with the parent keyword anywhere in the subject
parent_keyword_pattern = re.compile(r'\b(?:' + '|'.join(map(re.escape, parent_keyword.split())) + r')\b', re.IGNORECASE)
for item in user_folder.Items:
if parent_keyword_pattern.findall(item.Subject):
logging.info(f"Found parent keyword in Subject: {item.Subject}")
# Parent keyword found, now search for child keywords in the body
body_lower = item.Body.lower()
# Initialize output_text outside the child keywords loop
output_text = ""
for child_keyword in child_keywords:
# Search for child keyword in the body using regular expression
child_keyword_pattern = re.compile(re.escape(child_keyword), re.IGNORECASE)
matches = child_keyword_pattern.finditer(body_lower)
for match in matches:
logging.info(f"Found child keyword '{child_keyword}' at position {match.start()}-{match.end()}")
# Extract the paragraph around the matched position
paragraph_start = body_lower.rfind('\n', 0, match.start())
paragraph_end = body_lower.find('\n', match.end())
paragraph_text = item.Body[paragraph_start + 1:paragraph_end]
# Extract URLs from the paragraph using a simple pattern
url_pattern = re.compile(r'http[s]?://\S+')
urls = url_pattern.findall(paragraph_text)
# Append the results to the output_text
output_text += f"Child Keyword: {child_keyword}\n"
output_text += f"Paragraph Text: {paragraph_text}\n"
output_text += f"URLs: {', '.join(urls)}\n\n"
# Save the result to a text file
output_file = os.path.join(output_file_path, f"{item.Subject.replace(' ', '_')}.txt")
with open(output_file, 'w', encoding='utf-8') as f:
f.write(output_text)
logging.info(f"Saved results to {output_file}")
else:
logging.warning(f"Child keywords not found in folder '{folder_name}'.")
else:
logging.warning(f"Folder '{folder_name}' not found.")
except Exception as e:
logging.error(f"An error occurred: {str(e)}")
if __name__ == "__main__":
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Specify the path to the configuration file
config_file_path = "E:\\config2.json"
# Read configuration from the file
config = read_config(config_file_path)
# Search and save email based on the configuration
search_and_save_email(config)`
Unfortunate the code only given me the matched keyword, not any text and urls associated with those keyword. My output text file is like
Child Keyword: 土木一式工事
Paragraph Text: 土木一式工事
URLs:
Child Keyword: 産業用機器
Paragraph Text: 19 産業用機器
URLs:
Child Keyword: 産業用機器
Paragraph Text: 19 産業用機器
URLs:
I am pretty sure the problem is lied in the logic and the regex expression which i am trying to find out, but i need some help. Sorry for the long information guys.
The problem is that you're looking for a URL in paragraph_text
, which is the text enclosed by the newlines closest to the child_keyword
you find, which is only text like 01 事務用品・機器
, without a URL.
You can instead use a regex pattern that captures child keywords in an alternation pattern, the paragraph text enclosing the keyword, and the URL that follows:
rf'([^\n]*({"|".join(map(re.escape, child_keywords))})[^\n]*).*?\b(https?://\S+)'
so that:
for paragraph_text, child_keyword, url in re.findall(
rf'([^\n]*({"|".join(map(re.escape, child_keywords))})[^\n]*).*?\b(https?://\S+)',
body_lower,
re.S
):
print(f'{paragraph_text=}', f'{child_keyword=}', f'{url=}', sep='\n')
outputs:
paragraph_text='01 事務用品・機器'
child_keyword='事務用品・機器'
url='https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042214'
paragraph_text='01 事務用品・機器'
child_keyword='事務用品・機器'
url='https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350041978'
paragraph_text='01 事務用品・機器'
child_keyword='事務用品・機器'
url='https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042117'