Search code examples
pythonstringescaping

How to tell python that a string is actually bytes-object? Not converting


I have a txt file which contains a line:

 '        6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'

The contents in the double quotes is actually octal encoding, but with two escape characters.

After the line has been read in, I used regex to extract the contents in the double quotes.

c = re.search(r': "(.+)"', line).group(1)

After that, I have two problem:

First, I need to replace the two escape characters with one.

Second, Tell python that the str object c is actually a byte object.

None of them has been done.

I have tried:

re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)

All failed.

A bytes object can be easily define with 'b'.

c = b'\351\231\220\346\227\266\345\205\215\350\264\271'

How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.

I googled a lot, but with no answers. Maybe I use the wrong key word.

Does anyone know how to do these? Or other way to get what I want?


Solution

  • This is always a little confusing. I assume your bytes object should represent a string like:

    b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
    b.decode()
    # '限时免费'
    

    To get that with your escaped string, you could use the codecs library and try:

    import re
    import codecs
    
    line =  '        6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
    c = re.search(r': "(.+)"', line).group(1)
    
    codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
    # '限时免费'
    

    giving the same result.