Search code examples
pythonpython-2.7python-unicodemojibake

Identify garbage unicode string using python


My script is reads data from csv file, the csv file can have multiple strings of English or non English words.

Some time the text file has garbage strings , i want to identify those string and skip those string and process others

doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
    if is_valid_unicode_str(row['Name']):
         process_futher

def is_valid_unicode_str(value):
     try:
         function
         return True
     except UnicodeEncodeError:
         return false

csv input:

"Name"
"袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€"
"元大寶來證券"
"John Dove"

I want to defile function is_valid_unicode_str() which will identify the garbage string and process valid one only.

I tried to use decode is but it doesnt failed while decoding garbage strings

value.decode('utf8')

The expected output are string with Chinese and English string to be process

could you please guide me how can i implement function to filter valid Unicode files?.


Solution

  • You have Mojibake strings; text encoded to one (correct) codec, then decoded as another.

    In this case, your text was decoded with the Windows 1252 codepage; the U+20AC EURO SIGN in the text is typical of CP1252 Mojibakes. The original encoding could be one of the GB* family of Chinese encodings, or a multiple roundtrip UTF-8 - CP1252 Mojibake. Which one I cannot determine, I cannot read Chinese, nor do I have your full data; CP1252 Mojibakes include un-printable characters like 0x81 and 0x8D bytes that might have gotten lost when you posted your question here.

    I'd install the ftfy project; it won't fix GB* encodings (I requested the project add support), but it includes a new codec called sloppy-windows-1252 that'll let you reverse an erroneous decode with that codec:

    >>> import ftfy  # registers extra codecs on import
    >>> text = u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€'
    >>> print text.encode('sloppy-windows-1252').decode('gb2312', 'replace')
    猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
    >>> print text.encode('sloppy-windows-1252').decode('gbk', 'replace')
    猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
    >>> print text.encode('sloppy-windows-1252').decode('gb18030', 'replace')
    猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
    >>> print text.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'replace')
    袋�dcx与朋�们���
    

    The U+FFFD REPLACEMENT CHARACTER shows the decoding wasn't entirely successful, but that could be due to the fact that your copied string here is missing anything not printable or using the 0x81 or 0x8D bytes.

    You can try to fix your data this way; from the file data, try to decode to one of the GB* codecs after encoding to sloppy-windows-1252, or roundtrip from UTF-8 twice and see what fits best.

    If that's not good enough (you cannot fix the data) you can use the ftfy.badness.sequence_weirdness() function to try and detect the issue:

    >>> from ftfy.badness import sequence_weirdness
    >>> sequence_weirdness(text)
    9
    >>> sequence_weirdness(u'元大寶來證券')
    0
    >>> sequence_weirdness(u'John Dove')
    0
    

    Mojibakes score high on the sequence weirdness scale. You'd could try and find an appropriate threshold for your data by which time you'd call the data most likely to be corrupted.

    However, I think we can use a non-zero return value as a starting point for another test. English text should score 0 on that scale, and so should Chinese text. Chinese mixed with English can still score over 0, but you could not then encode that Chinese text to the CP-1252 codec while you can with the broken text:

    from ftfy.badness import sequence_weirdness
    
    def is_valid_unicode_str(text):
        if not sequence_weirdness(text):
            # nothing weird, should be okay
            return True
        try:
            text.encode('sloppy-windows-1252')
        except UnicodeEncodeError:
            # Not CP-1252 encodable, probably fine
            return True
        else:
            # Encodable as CP-1252, Mojibake alert level high
            return False