Search code examples
pythonregexbinaryfilespython-3.6

Python regex a binary file text file - how to use a range of numbers and word boundry?


I have a text file that requires me to read it in binary and write out in binary. No problem. I need to mask out Social Security Numbers with Xs, pretty easy normally:

text = re.sub("\\b\d{3}-\d{2}-\{4}\\b","XXX-XX-XXXX", text)

This is a sample of the text I'm parsing:

more stuff here CHILDREN�S 001-02-0003 get rid of that stuff goes here not001-02-0003 but ssn:001-02-0003

and I need to turn it into this:

more stuff here CHILDREN�S XXX-XX-XXXX get rid of that stuff goes here not001-02-0003 but ssn:XXX-XX-XXXX

Super! So now I'm trying to write that same regex 'in binary'. Here is what I've got and it's 'works' but gosh it doesn't feel right at all:

line = re.sub(b"\\B(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\\B", b"\x00X\x00X\x00X\x00-\x00X\x00X\x00-\x00X\x00X\x00X\x00X", line)

Notes:

  • that junk in CHILDRENS, gotta keep it like that
  • need to word boundary, thus the 4th line doesn't get masked out

Shouldn't my regex be a range of numbers instead? I just don't know how to do that in binary. And my word boundaries only work backwards as \B instead of \b, uh.. what is up with that?

UPDATE: I've also tried this:

line = re.sub(b"[\x30-\x39]", b"\x58", line)

and that does it for EVERY number, but if I try to even do something simple like:

line = re.sub(b"[\x30-\x39][\x30-\x39]", b"\x58\x58", line)

it doesn't match anything anymore, any idea why?


Solution

  • You might try:

    import re
    
    rx = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
    
    with open("test.txt", "rb") as fr, open("test2.txt", "wb+") as fp:
        repl = rx.sub('XXX-XX-XXXX', fr.read())
        fp.write(repl)
    

    This keeps every junk characters as they are and writes them to test2.txt.
    Note that, when you don't want every backslash escaped, you could use r'string here' in Python.