Search code examples
checksumhuman-readablefingerprinting

Where to find list of innocuous words for human-readable checksums, fingerprints


I have several applications that create a unique (with high probability), human-readable checksum or digital signature by applying a cryptographic hash like MD5, then using the resulting bits with an arithmetic coder to select words from a list. I've simply been using /usr/share/dict/words, but recently a client (rightly) complained about receiving a document whose checksum included offensive words or trigger words. (More details at my answer to Generate User Friendly Codes).

For this application, long lists are important, as they avoid repeats---the list I'm using has many tens of thousands of words.

Does anyone know either how to remove offensive and trigger words from such a list, or where to find a list of innocuous words?


Solution

  • One possibility is the ENABLE word list, used by Words with Friends and some other games. They try to avoid controversial words (pick your favorites and you won't find them there!-) It is in the public domain, so you can find it here and there. Its roughly 172,000 words. Here is one place I found it: http://www.greenworm.net/sites/default/files/gw-assets/enable1-wwf-v4.0-wordlist.txt

    Also, Scrabble has divergent lists - the company which owns the game has the "filtered" list, while the clubs use the unfiltered lists for competition. I don't want to post a link to offensive material, but if you Google "seattle scrabble club expurgated words", you might find a list of the words removed from the naughty list to produce the nice list. If you find all the words you got complaints about on that list, you could just use it as a filter.