Search code examples
pythonasciifrequencyanalysispython-3.7

Interpretation of Python Output regarding utf8 characters


I have a .txt file and I need to count the frequency of all characters in it, in order to do a simple frequency analysis for my cryptology excercise.

I think the code worked fine but it appears that Python has trouble to read characters such as Ä,Ö,ß etc. (German alphabet). As the code is reading a .txt file I assume it is in utf8 format.

This is the output:

Counter({' ': 168, 'S': 136, '\xc3': 103, 'Z': 83, 'G': 80, 'P': 80,
'W': 76, 'J': 66, 'O': 63, 'Q': 62, 'R': 57, 'U': 57, 'L': 47, '\x84': 43,
'K': 39, '\x9c': 28, 'X': 25, 'A': 23, 'C': 22, '\x9f': 18, 'E': 17, 'N':
17, '\x96': 14, ',': 11, 'D': 8, 'Y': 8, 'T': 6, 'V': 6, 'B': 5, '"': 4,
"'": 3, 'F': 2, 'M': 2, '!': 1, '-': 1, '?': 1}) [Finished in 0.1s]

My question is how to interpret the backslash characters such as '\xc3' and so on. I can't find anything online on how to translate it?

Edit (my code):

from collections import Counter
with open('/Users/StB/Downloads/text.txt') as f:
    c = Counter()
    for x in f:
        c += Counter(x.strip())
print c

Edit 2:

new output:

Counter({' ': 168, 'S': 136, 'Z': 83, 'P': 80, 'G': 80, 'W': 76, 'J': 66, 'O': 63, 'Q': 62, 'R': 57, 'U': 57, 'L': 47, 'Ä': 43, 'K': 39, 'Ü': 28, 'X': 25, 'A': 23, 'C': 22, 'ß': 18, 'N': 17, 'E': 17, 'Ö': 14, ',': 11, 'Y': 8, 'D': 8, 'T': 6, 'V': 6, 'B': 5, '"': 4, "'": 3, 'F': 2, 'M': 2, '-': 1, '!': 1, '?': 1})

new Code:

from collections import Counter
with open('/Users/StB/Downloads/text.txt', encoding= 'utf - 8') as f:
    c = Counter()
    for x in f:
        c += Counter(x.strip())
print (c)

endcoding does not work on the version i had running on sublime text. Worked fine in IDLE though!


Solution

  • In case of Python 2, you will need to explicitly decode the string you are reading into Unicode. You can also use Counter.update method to avoid creating and discarding Counter objects.

    from collections import Counter
    with open('/Users/StB/Downloads/text.txt') as f:
        c = Counter()
        for x in f:
            c.update(x.decode('utf-8').strip())
    print c