Search code examples
regexparsingnlppyparsing

Preprocess words that do not match list of words


I have a very specific case I'm trying to match: I have some text and a list of words (which may contain numbers, underscores, or ampersand), and I want to clean the text of numeric characters (for instance) unless it is a word in my list. This list is also long enough that I can't just make a regex that matches every one of the words.

I've tried to use regex to do this (i.e. doing something along the lines of re.sub(r'\d+', '', text), but trying to come up with a more complex regex to match my case. This obviously isn't quite working, as I don't think regex is meant to handle that kind of case.

I'm trying to experiment with other options like pyparsing, and tried something like the below, but this also gives me an error (probably because I'm not understanding pyparsing correctly):

from pyparsing import *
import re

phrases = ["76", "tw3nty", "potato_man", "d&"]
text = "there was once a potato_man with tw3nty cars and d& 76 different homes"
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(lambda word: re.sub(r'\d+', '', word)))
parser.parseString(text)

What's the best way to approach this sort of matching, or are there other better suited libraries that would be worth a try?


Solution

  • You are very close to getting this pyparsing cleaner-upper working.

    Parse actions generally get their matched tokens as a list-like structure, a pyparsing-defined class called ParseResults.

    You can see what actually gets sent to your parse action by wrapping it in the pyparsing decorator traceParseAction:

    parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(traceParseAction(lambda word: re.sub(r'\d+', '', word))))
    

    Actually a little easier to read if you make your parse action a regular def'ed method instead of a lambda:

    @traceParseAction
    def unnumber(word):
        return re.sub(r'\d+', '', word)
    parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(unnumber))
    

    traceParseAction will report what is passed to the parse action and what is returned.

    >>entering unnumber(line: 'there was once a potato_man with tw3nty cars and d& 76 different homes', 0, ParseResults(['there'], {}))
    <<leaving unnumber (exception: expected string or bytes-like object)
    

    You can see that the value passed in is in a list structure, so you should replace word in your call to re.sub with word[0] (I also modified your input string to add some numbers to the unguarded words, to see the parse action in action):

    text = "there was 1once a potato_man with tw3nty cars and d& 76 different99 homes"
    
    def unnumber(word):
        return re.sub(r'\d+', '', word[0])
    

    and I get:

    ['there', 'was', 'once', 'a', 'potato_man', 'with', 'tw3nty', 'cars', 'and', 'd&', '76', 'different', 'homes']
    

    Also, you use the '^' operator for your parser. You may get a little better performance if you use the '|' operator instead, since '^' (which creates an Or instance) will evaluate all paths and choose the longest - necessary in cases where there is some ambiguity in what the alternatives might match. '|' creates a MatchFirst instance, which stops once it finds a match and does not look further for any alternatives. Since your first alternative is a list of the guard words, then '|' is actually more appropriate - if one gets matched, don't look any further.