Search code examples
pythonnlpspacy

spaCy phrase matcher:TypeError when trying to remove matched phrases


I am trying to effectively clean text that was derived from automatic speech recognition software Using spaCy phrase matcher (https://spacy.io/usage/rule-based-matching#phrasematcher). The data is very dirty and does not separate speakers, so I am trying to remove repetitive phrases across all data samples. Using the rule-based phrase matcher, I am able to find the target text in my sample strings, but in trying to replace them with spaces, I receive a type error below: TypeError: replace() argument 1 must be str, not spacy.tokens.token.Token

My code is below:

# Import the required libraries:
import spacy
from spacy.matcher import PhraseMatcher

# Declare string from text extracted from a dataframe.  Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.  

conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"

# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# Declare a list of strings to search for in another string

terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python

# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp(conv_str)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end] # span is a list
    terms_not_needed = list(span)
    for item in terms_not_needed:
        conv_str.replace(item, ' ')

As I mentioned, I get the TypeError as printed out above. I understand that the str.replace argument requires a string, but I was thinking that by declaring the span a list that I could iterate through that terms_not_needed list for individual string matches. Any guidance would be very helpful.


Solution

  • There are a couple of issues with your approach here. One is that because of the way replace works, if you're using it there's no reason to use the PhraseMatcher - replace will already replace all instances of a string.

    What I would do instead is use an on_match callback to set a custom attribute, say token._.ignore, to True for anything your matcher finds. Then to get the tokens you're interested in you can just iterate over the Doc and take every token where that value isn't True.

    Here's a modified version of your code that does this:

    # Import the required libraries:
    import spacy
    from spacy.tokens import Token
    from spacy.matcher import PhraseMatcher
    
    Token.set_extension("ignore", default=False)
    
    # Declare string from text extracted from a dataframe.  Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.  
    
    conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"
    
    nlp = spacy.blank("en")
    # call the matcher
    matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
    
    
    def set_ignore(matcher, doc, id, matches):
        for _, start, end in matches:
            for tok in doc[start:end]:
                tok._.ignore = True
    
    # Declare a list of strings to search for in another string
    
    terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
    # the stack overflow interface is incorrectly coloring some of the term strings, but it works in python
    
    # create patterns variable
    patterns = [nlp.make_doc(text) for text in terms]
    matcher.add("TerminologyList", patterns, on_match=set_ignore)
    
    doc = nlp(conv_str)
    # this will run the callback
    matcher(doc)
    
    toks = [tok.text + tok.whitespace_ for tok in doc if not tok._.ignore]
    print("".join(toks))