Search code examples
c++boostcase-insensitiveboyer-moore

Example class for Boost boyer_moore search corpusIter type?


I've successfully used:

boost::algorithm::boyer_moore_search<const char *,const char *>( haystack, haystack_end, needle, needle_end )

to look for a needle in a haystack. Now I'd like to use BM_search to do a case-insensitive search for the needle. Since my haystack is giant, my plan is to convert the needle to lower case and have the haystack iterator treat the haystack characters as a special class whose compare function converts alphabetics to lower case before comparing. However, I haven't been able to express this correctly. I'm trying:

class casechar {
    public:
      char ch;
      // other stuff...it's not right, but I don't think the compiler is even getting this far
} ;

class caseiter : public std::iterator<random_access_iterator_tag,casechar> {
      const casechar *ptr;
    public:
      // various iterator functions, but not enough of them, apparently!
} ;

boost::algorithm::boyer_moore_search<const caseiter,const char *>( HaYsTaCk, HaYsTaCk_EnD, needle, needle_end );

The compiler (g++ on OSX) is complaining about an attempt to instantiate hash<casechar>, I guess for some BM internal thing. I'm lost in a maze of template<twisty_passages,all_different>. Could I impose on someone for a bit of direction? I suspect I just need to provide certain implementations in casechar and/or caseiter, but I don't know which ones.

Thanks!


Solution

  • The first problem you will run into is this:

    BOOST_STATIC_ASSERT (( boost::is_same<
                           typename std::iterator_traits<patIter>::value_type, 
                           typename std::iterator_traits<corpusIter>::value_type>::value ));
    

    This requires the value types of the iterator for the pattern and the iterator for the corpus to be the same type. In other words, you'll need to use casechar for the pattern as well.

    Here's what I'd do:

    • Write a casechar, with custom operator == etc for case-insensitive comparison.
    • No need to write a custom iterator. const casechar * is a perfectly acceptable random access iterator.
    • Write a std::hash<casechar> specialization. This specialization should probably simply return something like std::hash<char>()(std::tolower(ch)).

    That said, I'm somewhat doubtful that this will actually net you a performance gain, compared to just converting everything to lower case. The skip table for chars uses an optimization that uses arrays instead of unordered_maps for faster indexing and fewer heap allocations. This optimization is not used for a custom type such as casechar.