Search code examples
ruby-on-railsrubyfilterprofanity

Profanity filter import


I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.


Solution

  • As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.

    There are a few things I tell everyone that is looking to try and tackle a language filter.

    1. Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
    2. Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
    3. Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
    4. Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
    5. Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
    6. Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
    7. Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
    8. Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.

    You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.