Search code examples
c#lucenefull-text-searchsearch-enginelucene.net

Pull only AND Sections with a selected "Phrase" from a larger Boolean Search String


This is much more theory than coding, so forgive me.

User Scenario:

A user runs a search against our Lucene implementation. We return this data and tell them what words yielded the highest returns. The user notices "better" is significantly higher than its peers and wants to know why. We already provide the straight "Go look up occurences of Better" option but what the user really wants to see is "Better" without known false positives. False positives are removed with the AND's in the boolean search string that directly affect "Better". The easy scenario shows that best below. When the user chooses "Better" as their concern word then we have to make sure we don't show them false positives. This would likely be in the below scenario simplified to:

("better") AND NOT ("not great" OR "not good" OR 
"not better" OR "not best" OR "great pain" OR "great pains")

Complexity of the term does not limit that the false positive phrases will contain the word.

Some Supposes:

Suppose even if you tell me something as abstract as using reverse polish notation I will get what you mean.

Suppose we already know how to find "Better" in the main string or tree.

Suppose complexity can be any level sort of boolean searching such as:

Easy:

("Great" OR "Good" OR "Better" OR "Best") AND NOT ("not great" OR "not good" OR 
"not better" OR "not best" OR "great pain" OR "great pains")

enter image description here

More Complex:

    ("Great" OR ("Good" AND NOT "isn't that good") OR "Better" OR 
    ("Best" AND NOT "not the best") ) AND NOT ("not great" OR "not good" OR 
    "not better" OR "not best" OR "great pain" OR "great pains")

enter image description here

Even More Complex (With the same desired result):

    (("Great" OR ("Good" AND NOT "isn't that good") OR "Better" OR
    (("Terrific" OR "fun" AND NOT ("effing terrific" OR "effing fun")) ) OR 
    ("Best" AND NOT "not the best") ) OR "Fabulous" ) AND NOT ("not great" OR "not good" OR 
    "not better" OR "not best" OR "great pain" OR "great pains")

Here is a slight picture of one way to tree the most complicated one.  There are the words below the AND's I just hid them for readability.

Note: You should assume that these can be written very, very poorly. Don't focus too much on making these efficient these are user input and we will not edit them beforehand.

Now assume we could use a binary tree to crack this nut but I think it would take a very long time to explain to someone who can not just comprehend a binary tree on this, but it would look something like where "Great" or "not best", etc. would be a TERM_NODEs and operators - AND, NOT, OR would be OPERATOR_NODEs (Assume we align the tree by order of operation correctly.)

The Money Question:

Now assume we would like a deeper insight on "Better" excluding the other pieces of this term.

In a textual or binary tree how do I now exclude all other terms but maintain the AND NOT removals throughout my tree?

We basically receive input from the user that they would like to exclude all OR relationships to "Better", but maintain the AND NOTS that already apply to "Better."

Desired Result: The desired result for all cases would be:

("Better") AND NOT ("not great" OR "not good" OR 
"not better" OR "not best" OR "great pain" OR "great pains")

enter image description here

So we could then take this and run against SQL or Lucene for instance and get that result set back.


Solution

  • I eventually understood and then solved this with our existing tree implementation. It is kind of brute but effective as far as I can tell.

        private void button1_Click(object sender, EventArgs e)
        {
            //Step 1 parse your term to a tree
            var pt = ParseTerm.Parse(textBox1.Text, null);
            //Step 2 find the term we care about
            pt = FindTermNode(pt, "\"better\"");
            //Step 3 Walk the tree backwards and make the term
            var res = CreateANDTerm(pt);
            textBox2.Text = res;
        }
    
        enum direction
        { left, right}
    
        private string CreateANDTerm(ParseTerm inPt)
        {  
            direction fromDirection;
            var originPt = inPt;
            StringBuilder retStr = new StringBuilder();
            retStr.Append("("+originPt.myexpression+")");
            var currentPt = inPt;
            var nextPt = currentPt.parent;
    
            do
            {
                if (nextPt.left == currentPt)
                {
                    fromDirection = direction.left;
                }
                else
                {
                    fromDirection = direction.right;
                }
    
                if (nextPt.relation == ParseTerm.RelationType.RT_AND)
                {
                    if (fromDirection == direction.left)
                    {
                        retStr.Append(" AND (" + nextPt.right.myexpression + ")");
                    }
                    else
                    {
                        retStr.Append(" AND (" + nextPt.left.myexpression + ")");
                    }
                }
                currentPt = nextPt;
                nextPt = currentPt.parent;
            } while (nextPt != null);
            return retStr.ToString();
        }
    
        private ParseTerm FindTermNode(ParseTerm pt,string inStr)
        {
            if (pt.relation == ParseTerm.RelationType.RT_TERM && pt.searchstring == inStr)
            {
                return pt;
            }
            else
            {
                if(pt.left != null && pt.left.myexpression.Contains(inStr))
                {
                    var retPT = FindTermNode(pt.left, inStr);
                    if (retPT != null)
                        return retPT;                  
                }
                if (pt.right != null && pt.right.myexpression.Contains(inStr))
                {
                    var retPT = FindTermNode(pt.right, inStr);
                    if (retPT != null)
                        return retPT;                   
                }
                return null;
            }
        }