Search code examples
pythonregexparsingnlpquotes

Parsing replace quotes


I'm trying to parse a text file to do some statistics about it in python. To do so, I want to replace some punctuations by tokens. One example of such a token would be all the punctuations who terminate a sentence(.!? become <EndS>). I managed to do this using a regex. Now I'm trying to parse quotes. therefore, I think, I need a way to distinguish opening quotes and closing quotes. I'm reading the input file line by line and I have no guarantee that the quotes will be equilibrated.

As example:

 "Death to the traitors!" cried the exasperated burghers.
 "Go along with you," growled the officer, "you always cry the same thing over again. It is very tiresome."

should become something like:

 [Open] Death to the traitors! [Close] cried the exasperated burghers.
 [Open] Go along with you, [Close] growled the officer, [Open] you always cry the same thing over again. It is very tiresome. [Close]

Is it possible to do this using regexes? Is there an easier/better way to do this?


Solution

  • You can use sub method (module re):

    import re
    
    def replace_dbquote(render):
        return '[OPEN]' + render.group(0).replace('"', '') + '[CLOSE]'
    
    string = '"Death to the traitors!" cried the exasperated burghers. "Go along with you", growled the officer.'
    parser = re.sub('"[^"]*"', replace_dbquote, string)
    
    print(parser)
    

    https://docs.python.org/3.5/library/re.html#re.sub