I have a text file that requires me to read it in binary and write out in binary. No problem. I need to mask out Social Security Numbers with Xs, pretty easy normally:
text = re.sub("\\b\d{3}-\d{2}-\{4}\\b","XXX-XX-XXXX", text)
This is a sample of the text I'm parsing:
more stuff here
CHILDREN�S 001-02-0003 get rid of that
stuff goes here
not001-02-0003
but ssn:001-02-0003
and I need to turn it into this:
more stuff here
CHILDREN�S XXX-XX-XXXX get rid of that
stuff goes here
not001-02-0003
but ssn:XXX-XX-XXXX
Super! So now I'm trying to write that same regex 'in binary'. Here is what I've got and it's 'works' but gosh it doesn't feel right at all:
line = re.sub(b"\\B(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\\B", b"\x00X\x00X\x00X\x00-\x00X\x00X\x00-\x00X\x00X\x00X\x00X", line)
Notes:
Shouldn't my regex be a range of numbers instead? I just don't know how to do that in binary. And my word boundaries only work backwards as \B instead of \b, uh.. what is up with that?
UPDATE: I've also tried this:
line = re.sub(b"[\x30-\x39]", b"\x58", line)
and that does it for EVERY number, but if I try to even do something simple like:
line = re.sub(b"[\x30-\x39][\x30-\x39]", b"\x58\x58", line)
it doesn't match anything anymore, any idea why?
You might try:
import re
rx = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
with open("test.txt", "rb") as fr, open("test2.txt", "wb+") as fp:
repl = rx.sub('XXX-XX-XXXX', fr.read())
fp.write(repl)
This keeps every junk characters as they are and writes them to test2.txt
.
Note that, when you don't want every backslash escaped, you could use r'string here'
in Python
.