algorithm math discrete-mathematics scoring

Scoring System Suggestion - weighted mechanism?

I'm trying to validate a series of words that are provided by users. I'm trying to come up with a scoring system that will determine the likelihood that the series of words are indeed valid words.

Assume the following input:

xxx yyy zzz

The first thing I do is check each word individually against a database of words that I have. So, let's say that xxx was in the database, so we are 100% sure it's a valid word. Then let's say that yyy doesn't exist in the database, but a possible variation of its spelling exist (say yyyy). We don't give yyy a score of 100%, but maybe something lower (let's say 90%). Then zzz just doesn't exist at all in the database. So, zzz gets a score of 0%.

So we have something like this:

xxx = 100%
yyy = 90%
zzz = 0%

Assume further that the users are either going to either:

Provide a list of all valid words (most likely)
Provide a list of all invalid words (likely)
Provide a list of a mix of valid and invalid words (not likely)

As a whole, what is a good scoring system to determine a confidence score that xxx yyy zzz is a series of valid words? I'm not looking for anything too complex, but getting the average of the scores doesn't seem right. If some words in the list of words are valid, I think it increases the likelihood that the word not found in the database is an actual word also (it's just a limitation of the database that it doesn't contain that particular word).

NOTE: The input will generally be a minimum of 2 words (and mostly 2 words), but can be 3, 4, 5 (and maybe even more in some rare cases).

Solution

EDIT I have added a new section looking at discriminating word groups into English and non-English groups. This is below the section on estimating whether any given word is English.

I think you intuit that the scoring system you've explained here doesn't quite do justice to this problem.

It's great to find words that are in the dictionary - those words can be immediately give 100% and passed over, but what about non-matching words? How can you determine their probability? This can be explained by a simple comparison between sentences comprising exactly the same letters:

Abergrandly recieved wuzkinds
Erbdnerye wcgluszaaindid vker

Neither sentence has any English words, but the first sentence looks English - it might be about someone (Abergrandly) who received (there was a spelling mistake) several items (wuzkinds). The second sentence is clearly just my infant hitting the keyboard.

So, in the example above, even though there is no English word present, the probability it's spoken by an English speaker is high. The second sentence has a 0% probability of being English.

I know a couple of heuristics to help detect the difference:

Simple frequency analysis of letters

A typical distribution of letters in English. From Wikipedia

In any language, some letters are more common than others. Simply counting the incidence of each letter and comparing it to the languages average tells us a lot.

There are several ways you could calculate a probability from it. One might be:

Preparation
1. Compute or obtain the frequencies of letters in a suitable English corpus. The NLTK is an excellent way to begin. The associated Natural Language Processing with Python book is very informative.
The Test
1. Count the number of occurrences of each letter in the phrase to test
2. Compute the Linear regression where the co-ordinate of each letter-point is:
  1. X axis: Its predicted frequency from 1.1 above
  2. Y axis: The actual count
3. Perform a Regression Analysis on the data
  1. English should report a positive r close to 1.0. Compute the R^2 as a probability that this is English.
  2. An r of 0 or below is either no correlation to English, or the letters have a negative correlation. Not likely English.

Advantages:

Very simple to calculate

Disadvantages:

Will not work so well for small samples, eg "zebra, xylophone"
"Rrressseee" would seem a highly probably word
Does not discriminate between the two example sentences I gave above.

Bigram frequencies and Trigram frequencies

This is an extension of letter frequencies, but looks at the frequency of letter pairs or triplets. For example, a u follows a q with 99% frequency (why not 100%? dafuq). Again, the NLTK corpus is incredibly useful.

Digraph frequency based on a sample to 40,000 words

Above from: http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/digraphs.jpg

This approach is widely used across the industry, in everything from speech recognition to predictive text on your soft keyboard.

Trigraphs are especially useful. Consider that 'll' is a very common digraph. The string 'lllllllll' therefore consists of only common digraphs and the digraph approach makes it look like a word. Trigraphs resolve this because 'lll' never occurs.

The calculation of this probability of a word using trigraphs can't be done with a simple linear regression model (the vast majority of trigrams will not be present in the word and so the majority of points will be on the x axis). Instead you can use Markov chains (using a probability matrix of either bigrams or trigrams) to compute the probability of a word. An introduction to Markov chains is here.

First build a matrix of probabilities:

X axis: Every bigram ("th", "he", "in", "er", "an", etc)
Y axis: The letters of the alphabet.
The matrix members consist of the probability of the letter of the alphabet following the bigraph.

To start computing probabilities from the start of the word, the X axis digraphs need to include spaces-a, space-b up to space-z - eg the digraph "space" t represents a word starting t.

Computing the probability of the word consists of iterating over digraphs and obtaining the probability of the third letter given the digraph. For example, the word "they" is broken down into the following probabilities:

h following "space" t -> probability x%
e following th -> probability y%
y following he -> probability z%

Overall probability = x * y * z %

This computation solves the issues for a simple frequency analysis by highlighting the "wcgl" as having a 0% probability.

Note that the probability of any given word will be very small and becomes statistically smaller by between 10x to 20x per extra character. However, examining the probability of known English words of 3, 4, 5, 6, etc characters from a large corpus, you can determine a cutoff below which the word is highly unlikely. Each highly unlikely trigraph will drop the likelihood of being English by 1 to 2 orders of magnitude.

You might then normalize the probability of a word, for example, for 8-letter English words (I've made up the numbers below):

Probabilities from Markov chain:
- Probability of the best English word = 10^-7 (10% * 10% * .. * 10%)
- Cutoff (Probability of least likely English word) = 10^-14 (1% * 1% * .. * 1%)
- Probability for test word (say "coattail") = 10^-12
'Normalize' results
- Take logs: Best = -7; Test = -12; Cutoff = -14
- Make positive: Best = 7; Test = 2; Cutoff = 0
- Normalize between 1.0 and 0.0: Best = 1.0; Test = 0.28; Cutoff = 0.0
- (You can easily adjust the higher and lower bounds to, say, between 90% and 10%)

Now we've examined how to get a better probability that any given word is English, let's look at the group of words.

The group's definition is that it's a minimum of 2 words, but can be 3, 4, 5 or (in a small number of cases) more. You don't mention that there is any overriding structure or associations between the words, so I am not assuming:

That any group is a phrase, eg "tank commander", "red letter day"
That the group is a sentence or clause, eg " I am thirsty", "Mary needs an email"

However if this assumption is wrong, then the problem becomes more tractable for larger word-groups because the words will conform to English's rules of syntax - we can use, say, the NLTK to parse the clause to gain more insight.

Looking at the probability that a group of words is English

OK, in order to get a feel of the problem, let's look at different use cases. In the following:

I am going to ignore the cases of all words or all not-words as those cases are trivial
I will consider English-like words that you can't be assumed to be in a dictionary, like weird surnames (eg Kardashian), unusual product names (eg stackexchange) and so on.
I will use simple averages of the probabilities assuming that random gibberish is 0% while English-like words are at 90%.

Two words

(50%) Red ajkhsdjas
(50%) Hkdfs Friday
(95%) Kardashians program
(95%) Using Stackexchange

From these examples, I think you would agree that 1. and 2. are likely not acceptable whereas 3. and 4. are. The simple average calculation appears a useful discriminator for two word groups.

Three words

With one suspect words:

(67%) Red dawn dskfa
(67%) Hskdkc communist manifesto
(67%) Economic jasdfh crisis
(97%) Kardashian fifteen minutes
(97%) stackexchange user experience

Clearly 4. and 5. are acceptable.

But what about 1., 2. or 3.? Are there any material differences between 1., 2. or 3.? Probably not, ruling out using Baysian statistics. But should these be classified as English or not? I think that's your call.

With two suspect words:

(33%) Red ksadjak adsfhd
(33%) jkdsfk dsajjds manifesto
(93%) Stackexchange emails Kardashians
(93%) Stackexchange Kardashian account

I would hazard that 1. and 2. are not acceptable, but 3. and 4 definitely are. (Well, except the Kardashians' having an account here - that does not bode well). Again the simple averages can be used as a simple discriminator - and you can choose if it's above or below 67%.

Four words

The number of permutations starts getting wild, so I'll give only a few examples:

One suspect word:
1. (75%) Programming jhjasd language today
2. (93%) Hopeless Kardashian tv series
Two suspect words:
1. (50%) Programming kasdhjk jhsaer today
2. (95%) Stackexchange implementing Kasdashian filter
Three suspect words:
1. (25%) Programming sdajf jkkdsf kuuerc
2. (93%) Stackexchange bitifying Kardashians tweetdeck

In my mind, it's clear which word groups are meaningful aligns with the simple average with the exception of 2.1 - that's again your call.

Interestingly the cutoff point for four word groups might be different from three-word groups, so I'd recommend that your implementation has different a configuration setting for each group. Having different cutoffs is a consequence that the quantum jump from 2->3 and then 3->4 does not mesh with the idea of smooth, continuous probabilities.

Implementing different cutoff values for these groups directly addresses your intuition "Right now, I just have a "gut" feeling that my xxx yyy zzz example really should be higher than 66.66%, but I'm not sure how to express it as a formula.".

Five words

You get the idea - I'm not going to enumerate any more here. However, as you get to five words, it starts to get enough structure that several new heuristics can come in:

Use of Bayesian probabilities/statistics (what is the probability of the third word being a word given that the first two were?)
Parsing the group using the NLTK and looking at whether it makes grammatical sense

Problem cases

English has a number of very short words, and this might cause a problem. For example:

Gibberish: r xu r
Is this English? I am a

You may have to write code to specifically test for 1 and 2 letter words.

TL;DR Summary

Non-dictionary words can be tested for how 'English' (or French or Spanish, etc) they are using letter and trigram frequencies. Picking up English-like words and attributing them a high score is critical to distinguish English groups
Up to four words, a simple average has great discriminatory power, but you probably want to set a different cutoff for 2 words, 3 words and 4 words.
Five words and above you can probably start using Bayesian statistics
Longer word groups if they should be sentences or sentence fragments can be tested using a natural language tool, such as NLTK.
This is a heuristic process and, ultimately, there will be confounding values (such as "I am a"). Writing an perfect statistical analysis routine may therefore not be especially useful compared to a simple average if it can be confounded by a large number of exceptions.