I got an excel database containing two columns Cliche Phrases and type. I need to check a text document for exact match of phrase and return the type for matching phrases. Also better to red font matching phrase in original document. My interest is in identifying the type and returning type in a text file.
Currently identifying cliche phrase but excel work around eludes me.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
cliches = ['Abandon ship',
'About face',
'Above board',
'All ears']
cliche_patterns = [[{'LOWER':token.text.lower()} for token in nlp(cliche)] for cliche in cliches]
matcher = Matcher(nlp.vocab)
for counter, pattern in enumerate(cliche_patterns):
matcher.add("Cliche "+str(counter), None, pattern)
example_2 = nlp("We must abandon ship! It's the only way to stay above board.")
matches_2 = matcher(example_2)
for match in matches_2:
print(example_2[match[1]:match[2]])
MRE:
Mock text:
Two exquisite objection delighted deficient yet its contained. Cordial because are account evident its subject but eat. Can properly followed learning prepared you doubtful yet him. Over many our good lady feet ask that. Expenses own moderate day fat trifling stronger sir domestic feelings. you can’t judge a book by its cover Itself think outside the box at be answer always exeter up do. Though or my plenty uneasy do. Friendship so considered remarkably be to sentiments. Offered mention greater fifteen one promise because nor. Why can of worms denoting speaking fat indulged saw dwelling raillery.
Expected output:
Type A
Type B
I tried the code for the mock text.
A sentence containing a phrase called a cat sat on the wall.
A sentence containing a phrase called think outside the box.
A sentence containing a phrase called loose cannon.
A sentence containing a phrase called can of worms.
Instead of
a
b
c
d
as output
I am getting just getting
b
c
d
Did some minor changes to your code in cliche matching side. This writes the types of cliches to a txt
file without the color:
import spacy
from spacy.matcher import Matcher
from openpyxl import load_workbook
nlp = spacy.load('en_core_web_sm')
wb = load_workbook('cliche_phrases.xlsx')
ws = wb.active
cliche_database = {row[0].lower(): row[1] for row in ws.iter_rows(min_row=2, values_only=True)}
cliches = list(cliche_database.keys())
cliche_patterns = [[{'LOWER':token.text.lower()} for token in nlp(cliche)] for cliche in cliches]
matcher = Matcher(nlp.vocab)
matcher.add("Cliche", cliche_patterns)
# Read and process the mock text
with open('mock_text.txt', 'r') as file:
text = file.read()
doc = nlp(text)
matches = matcher(doc)
cliche_types_output = []
for match_id, start, end in matches:
cliche_phrase = doc[start:end].text
cliche_type = cliche_database.get(cliche_phrase)
if cliche_type:
cliche_types_output.append(cliche_type)
output_filename = 'output.txt'
with open(output_filename, 'w') as output_file:
output_file.write("\n".join(cliche_types_output))
I'll update the answer to include the coloring of matched words.