I need basically some math to rank short input sentences based on the following metrics:
1) Distance of terms relative to beginning of sentence (note: relative term distance, NO edit distance!). For an example searching for "a" should give sentence "a b" higher ranking than "b a", since a is closer to the beginning of the sentence.
2) distance of terms to each other. e.g. searching for "a" AND "b" should rank "ccc a b" higher than "a ccc b", since a and b are closer to each other.
3) Ranking based on order of terms. e.g. searching for a AND b should rank "a b" higher than "b a", since it is the correct order. Nevertheless, b a should be in the result set as well so it must be ranked also but by lower weight.
4) The words themselves are UNWEIGHTED. This is the main difference to what is widely common and what I could find info on easily. But in my case all terms have same weight, regardless of their occurrence / count in document or whatever.
I've done my research but found no match. Do you know what ranking algorith would match, or at least come close to this?
decimal Rank(string subject, IList<string> terms)
{
// Isolate all the words in the subject.
var words = Regex.Matches(subject, @"\w+")
.Cast<Match>()
.Select(m => m.Value.ToLower())
.ToList();
// Calculate the positions
var positions = new List<int>();
var sumPositions = 0;
foreach (var term in terms)
{
int pos = words.IndexOf(term.ToLower());
if (pos < 0) return decimal.MaxValue;
positions.Add(pos);
sumPositions += pos;
}
// Calculate the difference in average positions
decimal averageSubject = (decimal) sumPositions / terms.Count;
decimal averageTerms = (terms.Count - 1) / 2m; // average(0..n-1)
decimal rank = Math.Abs(averageSubject - averageTerms);
for (int i = 0; i < terms.Count; i++)
{
decimal relativePos1 = positions[i] - averageSubject;
decimal relativePos2 = i - averageTerms;
rank += Math.Abs(relativePos2 - relativePos1);
}
return rank;
}
I used lower value for better matches, since it is easier to measure distance from a perfect match, than the score of each match.
Example
Subject Terms Rank
"a b" "a" 0.0
"b a" "a" 1.0
"ccc a b" "a", "b" 1.0
"a ccc b" "a", "b" 1.5
"a b" "a", "b" 0.0
"b a" "a", "b" 2.0