Search code examples
pythondictionarysorteddictionary

Creating dictionary from a list of special characters


I'm working on this small script: basically it's mapping the list elements (with special characters in it) to its index to create a dictionary.

#!/usr/bin/env python
#-*- coding: latin-1 -*-

ln1 = '?0>9<8~7|65"4:3}2{1+_)'
ln2 = "(*&^%$£@!/`'\][=-#¢"

refStr = ln2+ln1

keyDict = {}
for i in range(0,len(refStr)):
    keyDict[refStr[i]] = i


print "-" * 32
print "Originl: ",refStr
print "KeyDict: ", keyDict

# added just to test a few special characters
tsChr = ['£','%','\\','¢']

for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

It returns the result like this:

Originl:  (*&^%$£@!/`'\][=-#¢?0>9<8~7|65"4:3}2{1+_)
KeyDict:  {'!': 9, '\xa3': 7, '\xa2': 20, '%': 4, '$': 5, "'": 12, '&': 2, ')': 42, '(': 0, '+': 40, '*': 1, '-': 17, '/': 10, '1': 39, '0': 22, '3': 35, '2': 37, '5': 31, '4': 33, '7': 28, '6': 30, '9': 24, '8': 26, ':': 34, '=': 16, '<': 25, '?': 21, '>': 23, '@': 8, '\xc2': 19, '#': 18, '"': 32, '[': 15, ']': 14, '\\': 13, '_': 41, '^': 3, '`': 11, '{': 38, '}': 36, '|': 29, '~': 27}

which is all good, except for the characters £, % and \ are converting to \xa3, \xa2 and \\ respectively. Does any one know why printing ln1/ln2 is just fine but the dictionary is not. How can I fix this? Any help greatly appreciated. Cheers!!


Update 1

I've added extra special characters - # and ¢ and then this is what I get following @Duncan's suggestion:

! 9
? 7
? 20
% 4
$ 5
....
....
8 26
: 34
= 16
< 25
? 21
> 23
@ 8
? 19
....
....

Notice that 7th, 19th and 20th elements, which is not printing correctly at all. 21st element is the actual ? character. Cheers!!


Update 2

Just added this loop to my original post to actually test my purpose:

tsChr = ['£','%','\\','¢']
for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

and this what I get as result:

£   not in the dic.
%   4
\   13
¢   not in the dic.

Whist running the script, it thinks that £ and ¢ are not actually in the dictionary - and that's my problem. Anyone knows how to fix that or what/where am I doing wrong?

eventually, I'll be checking for the character(s) from a file (or a line of text) in the dictionary to see if it exists and there is a chance of having character like é or £ and so on in the text. Cheers!!


Solution

  • In my humble opinion it would be useful to learn about unicode in general and it's use in python

    if you are not interested to know why people had to mess up things so you have to deal with a '\xa3' instead of having just a plain £ then Duncan answer above is perfect and tells you everything you want to know.

    Update (regardin your Update #2)

    please assert your file is saved with latin-1 encoding and non utf-8 as it's now and your test will pass (or just change #-*- coding: latin-1 -*- to #-*- coding: utf-8 -*-)

    This is a thing you could easily understand reading (and understanding) contents from my link above:

    your file is saved as utf-8 this means for char £ 2 bytes are used but since you tell python interpreter encoding is latin-1 he will use each of the 2 utf-8 bytes of £ for a key.

    Infact I can count 19 chars in ln2 but if you issue len(ln2) it will return 21.

    When you test for '£' in keyDict.keys() you are looking for a 2-char string while each of the 2-chars got its own key in dictionary, that's why it won't find it.

    Also you can test len(keyDict) and find it's longer than what you expect.

    I guess this explains everything, please understand not all the story is easy to be explained in a single webpage but the link above, in my humble opinion is a nice starting point, mixing some story and some coding examples.

    Cheers

    P.S.: I'm using this code, saving it as UTF-8 and it works flawlessly:

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    
    ln1 = u'?0>9<8~7|65"4:3}2{1+_)'
    ln2 = u"(*&^%$£@!/`'\][=-#¢"
    
    refStr = u"%s%s" % (ln2, ln1)
    
    keyDict = {}
    for idx, chr_ in enumerate(refStr):
        print chr_,
        keyDict[chr_] = idx
    
    print u"-" * 32
    print u"Originl: ", refStr
    print u"KeyDict: ", keyDict
    
    tsChr = [u'£', u'%', u'\\', u'¢']
    for k in tsChr:
        if k in keyDict.keys():
            print k, "\t", keyDict[k]
        else: print k, repr(k), "\t", "not in the dic."