Search code examples
stringlinguisticsnlp

How do I determine if a random string sounds like English?


I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.

EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.


Solution

  • You can build a markov-chain of a huge english text.

    Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.

    See here: http://en.wikipedia.org/wiki/Markov_chain

    At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.

    In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.