Search code examples
plonezopecatalog

How can I look for objects with missing value or None as key?


I would like to perform a search on the zope catalog of the objects with missing index key values. Is it possible?

For example consider the subsequent code lines:

from Products.CMFCore.utils import getToolByName
catalog = getToolByName(context, 'portal_catalog')
results = catalog.searchResults({'portal_type': 'Event', 'review_state': 'pending'})

what to do if I'm interested in objects in which a certain item, instead of portal_type or review_state, has not be inserted?


Solution

  • You can search for both types, but to search for MissingValue entries requires custom handling of the internal catalog data structures.

    Indexes take the value from an object, and index that. If there is an AttributeError or similar, the index does not store anything for that object, and if the same field is part of the returned columns, in that case a MissingValue will be given to indicate the index is empty for that field.

    In the following examples I assume you have a variable catalog that points to the site's portal_catalog tool; e.g. the result of getToolByName(context, 'portal_catalog') or similar.

    Searching for None

    You can search for None in many indexes just fine:

    catalog(myKeywordIndex=None)
    

    The problem is that most indexe types ignore None as a value. Thus, searching for None will fail on Date and Path indexes; they ignore None on index, and Boolean indexes; they turn None into False when indexing.

    Keyword indexes ignore None as well, unless it is part of a sequence. If the indexed method returns [None] it'll happily be indexed, but None on it's own won't be.

    Field indexes do store None in the index.

    Note that each index can show unique values, so you can check if there are None values stored for a given index by calling:

    catalog.uniqueValuesFor(indexname)
    

    Searching for missing values

    This is a little trickier. Each index does keep track of what objects it has indexed, to be able to remove data from the index when the object is removed, for example. At the same time, the catalog keeps track of what objects it has indexed as a whole.

    Thus, we can calculate the difference between these two sets of information. This is what the catalog does all the time when you call the published APIs, but for this trick there is no such public API. We'll need to reach into the catalog internals and grab these sets for ourselves.

    Luckily, these are all BTree sets, and the operations are thus relatively efficient. Here is how I'd do it:

    from BTrees.IIBTree import IISet, difference
    
    def missing_entries_for_index(catalog, index_name):
        # Return the difference between catalog and index ids
        index = catalog._catalog.getIndex(index_name)
        referenced = IISet(index.referencedObjects()) # Works with any UnIndex-based index
        return (
            difference(IISet(catalog._catalog.paths), referenced),
            len(catalog) - len(referenced)
        )
    

    The missing_entries_for_index method returns an IISet of catalog ids and it's length; each is a pointer to a catalog record for which the named index has no entry. You can then use catalog.getpath to turn that into a full path to objects, or use catalog.getMetadataForRID to get a dictionary of metadata values, or use catalog.getobject to get the original object itself, or use catalog._catalog[] to get catalog brains.

    The following method will give you a catalog result set, just like you would get from a regular catalog search:

    from ZCatalog.Lazy import LazyMap
    
    def not_indexed_results(catalog, index_name):
        rs, length = missing_entries_for_index(catalog, index_name)
        return LazyMap(catalog._catalog.__getitem__, rs.keys(), length)