Search code examples
c#searchfilterlucenelucene.net

Lucene.NET 2.9 Custom Filter to add authorisation


Friends,

I'm new to Lucene...
I successfully created an index, added fields, I could search etc it works.

Now, I've in my database a view that tell which users can see which document. This view is created using several complicated rules so I want to reuse the view. So I need to add a filter in Lucene search to remove documents that match the query but the users doesn't have access to.
What I tried to do now is:
- Store the db document id in a field. It's a Guid, I store it as a string.
- create a custom filter that fetch all document id the current user can access, then filter using the field in lucene

I've the feeling that it'll not be efficient... User can have access to hundred of thousands documents, so I may retrieve 200 000 document Id I need to filter on. I suppose I've to cache some stuff...
Here is the code I've writen, but it doesn't work: no document are returned when the filter is used (it should return 3 docs)

public class LuceneAuthorisationFilter : Filter
{
    public override DocIdSet GetDocIdSet(Lucene.Net.Index.IndexReader reader)
    {
        List<Guid> ids = this.load(); // Load list of ID from database
        OpenBitSet result = new OpenBitSet(reader.MaxDoc);

        int[] docs = new int[1];
        int[] freq = new int[1];

        for (int i = 0; i < ids.Count; i++)
        {
            Lucene.Net.Index.TermDocs termDocs = reader.TermDocs(new Lucene.Net.Index.Term("EmId", ids.ElementAt(i).ToString()));

            int count = termDocs.Read(docs, freq);
            if (count == 1)
            {
                result.FastSet(docs[0]);
            }
        }
        return result;
    }
}

Do you have any idea on what's wrong ? And how to increase perf ?

Thank you

EDIT:
The code above works, the problem was only that the EmId field was not indexed. Now I've changed and it works.
Now I would like to have any tip to improve performances


2ND EDIT TO ADD FEEDBACK

Note: The test environment contains 25 000 documents, and the list of document access contains 50 000 id (because all documents are not yet

indexed)

  • Using the custom filter above: ~2600ms the 1st time, 2100ms the next times as filter is cached
  • Using a Boolean query filter: ~4700ms then ~4000ms

These are poor performances ... So I've searched again an found 'FieldCacheTermsFilter' filter.

  • Using a FieldCacheTermsFilter: ~600ms then ~60ms

This is acceptable performance

PS: I also found another similar question


Solution

  • Talking about performances is always tricky when no numbers/measurements are given.

    That being said, what have you benched in terms of performances? What are your bottlenecks (IO/CPU/etc) and have you compared it against other methods?

    Do you actually need to improve performance? Discussions about perfomance improvements are not about "feelings", they are around hard facts based on evidence and a need to improve.

    Now for your Filter, unless theres something I didnt get from the question, I dont see why you cannot use what is already build into Lucene to do the hard work.

    Here is how I usually handle permission stuff in Lucene, it always worked well with indexes containing billions of documents. I usually use LRU type caches with a minimum age for an items to be purged off the cache.

    IE: cache 100 items, but cache more if the least recently used is not more than 15 minutes old.

    If you try something like this, it could be interesting if you compare it to your method and come back to post some performance numbers.

    Disclaimer: code written directly in the textarea of SO, take it more as pseudo-code than an already working copy paste solution:

    // todo: probably some thread safety
    public class AccessFilterFactory
    {
        private static AccessFilterFactory _instance = new AccessFilterFactory();;
        private AccessFilterFactory()
        {
        }
    
        public AccessFilterFactory Instance
        {
            get
            {
                return _instance;
            }
        }
    
        private Cache<int, Filter> someKindaCache = new Cache<int, Filter> ();
    
        // gets a cached filter if already built, if not it creates one
        // caches it and returns it
        public Filter GetFilterForUser(int userId)
        {
            // return from cache if you got it
            if(someKindaCache.Exists(userId))
                return someKindaCache.Get(userId);
    
            // if not, build and cache it
            BooleanQuery filterQuery = new BooleanQuery();
            foreach(string id in ids)
            {
                filterQuery.Add(new TermQuery(new Term("EmId", id)),  BooleanClause.Occur.SHOULD);
            }
            Filter cachingFilter = new CachingWrapperFilter(new QueryWrapperFilter(filterQuery));
            someKindaCache.Put(userId, cachingFilter);
            return cachingFilter;
        }
    
        // removes a new invalid filter from cache (permissions changed)
        public void InvalidateFilter(int userId)
        {
            someKindaCache.Remove(userId);
        }   
    }