Search code examples
hierarchyxapianprefixes

What is the best way to represent a category hierarchy using term prefixes in Xapian?


Assume I have the following example hierarchy:

  • US
    • Michigan
      • Detroit
      • Grand Rapids
      • Lansing
    • Minnesota
      • Grand Rapids
      • Minneapolis
      • St Paul
    • Ohio
      • Columbus
      • Grand Rapids
      • Sandusky

I see two ways that I could index a “Grand Rapids, Michigan” document with prefixed terms:

XFIRSTLEVELus
XSECONDLEVELmichigan
XTHIRDLEVELgrandrapids

or

XFIRSTLEVELus
XSECONDLEVELus_michigan
XTHIRDLEVELus_michigan_grandrapids

I’m inclined to use the second approach thinking that it will return more intuitive results. That is, a search that includes Grand Rapids, Michigan search criteria is less likely to include documents from Minnesota and Ohio.

However, two aspects of this approach bother me. First, the creation and maintenance of term prefixes for each level of the hierarchy feels wrong. Second, the concatenation of values seems like a surrogate for using weights.

So, what is the best way to represent a hierarchy with term prefixes?


Solution

  • As with all these things, It might be best to think about how you want to use the data, rather than what the 'best' way of storing it is.

    In the past, I have stored location data like you describe as if they were URL paths, converting the place name in to a slug, so your example above would look something like:

    us
    us/michigan
    us/michigan/detroit
    us/michigan/grand-rapids
    us/michigan/lansing
    us/minnesota
    us/minnesota/grand-rapids
    us/minnesota/minneapolis
    us/minnesota/st-paul
    us/ohio
    us/ohio/columbus
    us/ohio/grand-rapids
    us/ohio/sandusky
    

    Give each document a prefixed term with one of those paths, and use an exact term search to get all documents only in a place (location:us/minnesota/minneapolis) or a wildcard search to get all children of a location (location:us/minnesota/*)

    This may or may not be the 'best' solution, but it might work for some applications :)