Search code examples
pythonpython-re

How can I simplify this method to replace punctuation while keeping special words intact?


I am making a modulatory function that will take keywords with special characters (@&\*%) and keep them intact while all other punctuation is deleted from a sentence. I have devised a solution, but it is very bulky and probably more complicated than it needs to be. Is there a way to do this, but in a much simpler way?

In short, my code matches all instances of the special words to find the span. I then match the characters to find their span, and then I loop over the list of matches and remove any characters that also exist in the span of the found words.

Code:

import re
from string import punctuation

sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."

# my attempt to remove punctuation
class SentenceHolder:
    sentence = None
    protected_words = ["Q&A"]

    def __init__(sentence):
        self.sentence = sentence

    def remove_punctuation(self):
        for punct in punctuation:
            symbol_matches: List[re.Match] = [i for i in re.finditer(punct, self.sentence)]
            remove_able_matches = self._protected_word_overlap(symbol_matches)

        for word in reversed(remove_able_word_matches):
            self.sentence = (self.modified_string[:word.start()] + " " + self.sentence[word.end():])                

    def _protected_word_overlap(symbol_matches)
          protected_word_locations = []
          for protected_word in self.protected_words :
              protected_word_locations.extend([i for i in re.finditer(protected_word, self.sentence)])

          
        protected_matches = []
        for protected_word in protected_word_locations:
            for symbol_inst in symbol_matches:
                symbol_range: range = range(symbol_inst.start(), symbol_inst.end())
                protested_word_set = set(range(protected_word.start(), protected_word.end()))
                if len(protested_word_set.intersection(symbol_range)) != 0:
                    protected_matches.append(symbol_inst)

        remove_able_matches = [sm for sm in symbol_matches if sm not in protected_matches]

        return remove_able_matches

The output of the code:

my_string = SentenceHolder(sentence)
my_string.remove_punctuation()

Result:

"I am going to run over to Q&A and ask them a ton of questions about this  that   that   this while surfacing the internet  with my raccoon buddy   the bar"

I tried to use regex and pattern to identify all the locations of the punctuation, but the pattern I use in re.sub does not work similarly in re.match.


Solution

  • probably not the best, but really simple

    protected = ["Q&A", "stack@exchange"]
    protected_dict = {f'protected{i}': p_word for i, p_word in enumerate(protected)}
    sentence = "I am going to run over to Q&A stack@exchange and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
    
    # protect
    for k, v in protected_dict.items():
        sentence = sentence.replace(v, k)
    
    # replace stuff
    sentence = sentence.replace('&', '')
    sentence = sentence.replace('@', '')
    
    # revert back protected words
    for k, v in protected_dict.items():
        sentence = sentence.replace(k, v)
    
    print(sentence) # I am going to run over to Q&A stack@exchange and ask them a ton of questions about this  that  that  this while surfacing the internet! with my raccoon buddy  the bar.