Search code examples
pythonregexcharacter-encodingpython-2.xdigits

How to detect double-byte numbers


I have to check strings in Japanese that are encoded in double-byte characters (naturally the files aren't in Unicode and I have to keep them in Shift-JIS). Many of these strings contain digits that are also double byte characters, (123456789) instead of standard single-byte digits (0-9). As such, the usual methods of searching for digits won't work (using [0-9] in regex, or \d for example).

The only way I've found to make it work is to create a tuple and iterate over the tuple in a string to look for a match, but is there a more effective way of doing this?

This is an example of the output I get when searching for double byte numbers:

>>> s = "234"  # "2" is a double-byte integer
>>> if u"2" in s:
      print "y"

>>> if u"2" in s:
      print "y"

    y
>>> print s[0]

>>> print s[:2]
    2
>>> print s[:3]
    23

Any advice would be greatly appreciated!


Solution

  • First of all, the comments are right: for the sake of your sanity, you should only ever work with unicode inside your Python code, decoding from Shift-JIS that comes in, and encoding back to Shift-JIS if that's what you need to output:

    text = incoming_bytes.decode("shift_jis")
    # ... do stuff ...
    outgoing_bytes = text.encode("shift_jis")
    

    See: Convert text at the border.

    Now that you're doing it right re: unicode and encoded bytestrings, it's straightforward to get either "any digit" or "any double width digit" with a regex:

    >>> import re
    >>> s = u"234"
    >>> digit = re.compile(r"\d", re.U)
    >>> for d in re.findall(digit, s):
    ...     print d,
    ... 
    2 3 4
    >>> wdigit = re.compile(u"[0-9]+")
    >>> for wd in re.findall(wdigit, s):
    ...     print wd,
    ... 
    2
    

    In case the re.U flag is unfamiliar to you, it's documented here.