Search code examples
web2pyweb2py-modules

How to filter words in db.body


I am working on a program that I want to filter out some words, with nltk style of removing the stopwords as follows:

def phrasefilter(phrase):
    phrase = phrase.replace('hi', 'hello')
    phrase = phrase.replace('hey', 'hello')
    phrase = re.sub('[^A-Za-z0-9\s]+', '', phrase.lower())
    noise_words_set = ['of', 'the', 'at', 'for', 'in', 'and', 'is', 'from', 'are', 'our', 'it', 'its', 'was', 'when', 'how', 'what', 'like', 'whats', 'now', 'panic', 'very']
    return ' '.join(w for w in phrase.split() if w.lower() not in noise_words_set)

Is there a way of doing this on web2py DAL.

db.define_table( words,
    Field(words1, REQUIRES  IS_NOT_NULL(), REQUIRES....

I want to put it in the REQUIRES IS_NOT_IN_NOISE_WORDS_SET() constraints for example. Is this possible? Am working on a user input( with strings saved to the db) where it automatically deletes the stopwords I have chosen instead of the using the snippet shown above.


Solution

  • You have several options. First, you can create a custom validator that simply acts as a filter. A validator takes a value and returns a tuple including the (possibly transformed) value and either None or an error message (in this case, we want to return None as the second element of the tuple given that we are only transforming the value but not checking for errors).

    def filter_noise_words(input):
        filtered_input = [code to remove stop words]
        return (filtered_input, None)
    
    db.define_table('words',
                    Field('words1', requires=[filter_noise_words, IS_NOT_EMPTY()]))
    

    Note, the IS_NOT_EMPTY validator comes after the filtering to ensure the post-filtered input is not empty.

    Another option would be to do the filtering via the filter_in attribute of the field:

    def filter_noise_words(input):
        filtered_input = [code to remove stop words]
        return filtered_input
    
    db.define_table('words',
                    Field('words1', requires=IS_NOT_EMPTY(), filter_in=filter_noise_words))
    

    The advantage of using filter_in is that it applies to all inserts and updates (made via the DAL), whereas a validator would only be applied when using SQLFORM (or when explicitly calling the special .validate_and_insert and .validate_and_update methods). The disadvantage of filter_in is that the filter is applied after any validators, so IS_NOT_EMPTY would run on the pre-filtered input.

    Finally, rather than filtering the input before storing it, you might consider storing the original input and then either storing the filtered input in a separate computed field or using a virtual field.