string character-encoding pattern-matching boyer-moore

Do I have to take the encoding into account when performing Boyer-Moore pattern matching?

I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?

Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?

In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.

So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?

Solution

A multi-byte encoding can be used with a byte-oriented search routine IF it is self-synchronizing.

So, you can safely use Boyer-Moore with:

CESU-8
UTF-8
UTF-EBCDIC

But can NOT use it with:

Big5
EUC-JP
GBK / GB18030
ISO 2022
Johab
Punycode
Shift-JIS
UTF-7
UTF-16
UTF-32