unicode string in file contain different

My system is fedora. From some reason .The last field of one record is a unicode string (use memcpy copy data from a guest machine in qemu) . The unicode string is windows regedit key name.

smss.exe|NtOpenKey|304|4|4|0|\^@R^@e^@g^@i^@s^@t^@r^@y^@\^@M^@a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@C^@o^@n^@t^@r^@o^@l^@\^@S^@e^@s^@s^@i^@o^@n^@ ^@M^@a^@n^@a^@g^@e^@r^@ smss.exe|NtClose|304|4|4|0|System|NtOpenKey|4|0|2147484532|0|\^@R^@e^@g^@i^@s^@t^@r^@y^@\^@M^@a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@ services.exe|NtOpenKey|680|624|636|0|\^@R^@E^@G^@I^@S^@T^@R^@Y^@\^@M^@A^@C^@H^@I^@N^@E^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@S^@e^@r^@v^@i^@c^@e^@s^@

Here is the some of hex code: use '|' as split char. The first 6 fields was ascii sting .The last field is a window unicode string (which I think it is utf-16 code) .

0000000 6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
0000010 7965 337c 3430 347c 347c 307c 5c7c 5200
0000020 6500 6700 6900 7300 7400 7200 7900 5c00
0000030 4d00 6100 6300 6800 6900 6e00 6500 5c00
0000040 5300 7900 7300 7400 6500 6d00 5c00 4300
0000050 7500 7200 7200 6500 6e00 7400 4300 6f00
0000060 6e00 7400 7200 6f00 6c00 5300 6500 7400
0000070 5c00 4300 6f00 6e00 7400 7200 6f00 6c00
0000080 5c00 5300 6500 7300 7300 6900 6f00 6e00
0000090 2000 4d00 6100 6e00 6100 6700 6500 7200

I will use python to parse it and insert it a db . Here is how i handle

def parsecreate(filename):
    sourcefile = codecs.open("data.db",mode="r",encoding='utf-8')
    cx = sqlite3.connect("sqlite.db")
    cu = cx.cursor()
    cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)")
    eachline = []
    for lines in sourcefile:
        eachline = lines.split('|')
        eachline[-1] = eachline[-1].strip('\n')
        eachline[-1] = eachline[-1].decode('utf-8')

        cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1]) )

    cx.commit()
    cx.close()

I will got wrong :

File "./parse1.py", line 18, in parsecreate for lines in sourcefile: File "/usr/lib/python2.7/codecs.py", line 684, in next return self.reader.next() File "/usr/lib/python2.7/codecs.py", line 615, in next line = self.readline() File "/usr/lib/python2.7/codecs.py", line 530, in readline data = self.read(readsize, firstline=True) File "/usr/lib/python2.7/codecs.py", line 477, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 51: invalid continuation byte

Becuase the unicode string may contain a byte the utf8 not know it. How Can read the last field rightly?

Simply to say . There is a unicode string in not a utf-16 encode file, How can make the field rightly be insert into the db? Python read a file use one encoding style. Can I just read origin bytes .Can combine these bytes into a unicode string.

Solution

Your data file is not a text-only file so open the file as binary and decode the text fields explicitly. I had to manipulate the data quite a bit to get back what I think was the original binary data. It looks like the original data may have been a sqlite3.exe dump similar to my final output below, except the data for the final field was stored as a UTF-16-encoded BLOB instead of TEXT.

Note that parsing by lines and splitting by '|' may encounter problems if the UTF-16 data contains the bytes representing '\n' or '|', but I'll ignore that for now.

Here's my test:

from binascii import unhexlify
import sqlite3

data = unhexlify('''\
6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
7965 337c 3430 347c 347c 307c 5c7c 5200
6500 6700 6900 7300 7400 7200 7900 5c00
4d00 6100 6300 6800 6900 6e00 6500 5c00
5300 7900 7300 7400 6500 6d00 5c00 4300
7500 7200 7200 6500 6e00 7400 4300 6f00
6e00 7400 7200 6f00 6c00 5300 6500 7400
5c00 4300 6f00 6e00 7400 7200 6f00 6c00
5c00 5300 6500 7300 7300 6900 6f00 6e00
2000 4d00 6100 6e00 6100 6700 6500 7200'''.replace(' ','').replace('\n',''))

# OP's data dump must have been decoded from the original data
# as little-endian words, and is missing a final 0x00 byte.
# Byte-swapping and adding missing zero byte to get back what
# was likely the original binary data.
data = ''.join(a+b for a,b in zip(data[1::2],data[::2])) + '\x00'

with open('data.db','wb') as f:
    f.write(data)

def parsecreate(filename):
    with open(filename,'rb') as sourcefile:
        with sqlite3.connect("sqlite.db") as cx:
            cu = cx.cursor()
            cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)")
            eachline = []
            for line in sourcefile:
                eachline = line.split('|')
                eachline[-1] = eachline[-1].decode('utf-16le')
                cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1]) )

parsecreate('data.db')

Output:

C:\>sqlite3 sqlite.db
SQLite version 3.7.9 2011-11-01 00:52:41
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select * from data;
1|smss.exe|NtOpenKey|304|4|4|0|\Registry\Machine\System\CurrentControlSet\Control\Session Manager