python regex character-encoding python-2.x digits

How to detect double-byte numbers

I have to check strings in Japanese that are encoded in double-byte characters (naturally the files aren't in Unicode and I have to keep them in Shift-JIS). Many of these strings contain digits that are also double byte characters, (１２３４５６７８９) instead of standard single-byte digits (0-9). As such, the usual methods of searching for digits won't work (using [0-9] in regex, or \d for example).

The only way I've found to make it work is to create a tuple and iterate over the tuple in a string to look for a match, but is there a more effective way of doing this?

This is an example of the output I get when searching for double byte numbers:

>>> s = "２34"  # "2" is a double-byte integer
>>> if u"2" in s:
      print "y"

>>> if u"２" in s:
      print "y"

    y
>>> print s[0]

>>> print s[:2]
    ２
>>> print s[:3]
    ２3

Any advice would be greatly appreciated!

Solution

First of all, the comments are right: for the sake of your sanity, you should only ever work with unicode inside your Python code, decoding from Shift-JIS that comes in, and encoding back to Shift-JIS if that's what you need to output:

text = incoming_bytes.decode("shift_jis")
# ... do stuff ...
outgoing_bytes = text.encode("shift_jis")

See: Convert text at the border.

Now that you're doing it right re: unicode and encoded bytestrings, it's straightforward to get either "any digit" or "any double width digit" with a regex:

>>> import re
>>> s = u"２34"
>>> digit = re.compile(r"\d", re.U)
>>> for d in re.findall(digit, s):
...     print d,
... 
２ 3 4
>>> wdigit = re.compile(u"[０-９]+")
>>> for wd in re.findall(wdigit, s):
...     print wd,
... 
２

In case the re.U flag is unfamiliar to you, it's documented here.