Search code examples
stringcharacter-encodingpattern-matchingboyer-moore

Do I have to take the encoding into account when performing Boyer-Moore pattern matching?


I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?

Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?

In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.

So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?


Solution

  • A multi-byte encoding can be used with a byte-oriented search routine IF it is self-synchronizing.

    So, you can safely use Boyer-Moore with:

    • CESU-8
    • UTF-8
    • UTF-EBCDIC

    But can NOT use it with:

    • Big5
    • EUC-JP
    • GBK / GB18030
    • ISO 2022
    • Johab
    • Punycode
    • Shift-JIS
    • UTF-7
    • UTF-16
    • UTF-32