Search code examples
pythondictionarykeyerrorpython-shove

Shove giving KeyError when assigning value


I am using shove to avoid loading a huge dictionary into memory.

from shove import Shove

lemmaDict = Shove('file://storage')
with open(str(sys.argv[1])) as lemmaCPT:\
    for line in lemmaCPT:
        line = line.rstrip('\n')
        lineAr = string.split(line, ' ||| ')
        lineKey = lineAr[0] + ' ||| ' + lineAr[1]
        lineValue = lineAr[2]
        print lineValue
        lemmaDict[lineKey] = lineValue

However, I'm getting the following KeyError and Traceback partway through reading lemmaCPT. What's going on?

Traceback (most recent call last):
  File "./stemmer.py", line 19, in <module>
    lemmaDict[lineKey] = lineValue
  File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 44, in __setitem__
    self.sync()
  File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 74, in sync
    self._store.update(self._buffer)
  File "/opt/Python-2.7.6/lib/python2.7/_abcoll.py", line 542, in update
    self[key] = other[key]
  File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/base.py", line 123, in __setitem__
    raise KeyError(key)
KeyError: '! ! ! \xd1\x87\xd0\xb8\xd1\x82\xd0\xb0\xd0\xb5\xd1\x82\xd1\x81\xd1\x8f \xd1\x82\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x80\xd0\xb0\xd1\x82\xd0\xbd\xd1\x8b\xd0\xbc \xd0\xbf\xd0\xbe\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5\xd0\xbc \xd0\xbb\xd1\x8e\xd0\xb1\xd0\xbe\xd0\xb3\xd0\xbe ||| ! ! ! is pronounced by'

Sample input:

! ! ! читается троекратным повторением ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.0128612
! ! ! читается троекратным повторением ||| ! ! ! ||| 0.000119622 8.53148e-39 0.0098932 0.590703
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced by ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.00154241
! ! ! читается троекратным повторением ||| , ! ! ! ||| 0.0074488 8.53148e-39 0.00989281 0.070842
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39

Running code.py sampleinput will yield the aforementioned KeyError and Traceback.


Solution

  • Well if this is the actual input then the problem is with length of LemmaDict and the input...

    aftnix@dev:~⟫ cat input | wc -l
    11
    

    My changed code....

    from shove import Shove
    import sys
    import string
    
    lemmaDict = Shove('file://storage')
    i = 0
    with open(str(sys.argv[1])) as lemmaCPT:
        for line in lemmaCPT:
            line = line.rstrip('\n')
            lineAr = string.split(line, ' ||| ')
            lineKey = lineAr[0] + ' ||| ' + lineAr[1]
            lineValue = lineAr[2]
            print lineValue
            print len(lemmaDict)
            #print len(lemmaCPT)
            i+=1
            print i
            #lemmaDict[lineKey] = lineValue
    

    Gives the following output...

    0.00744887 8.53148e-39 0.00989281 8.53148e-39
    9
    1
    0.00744887 8.53148e-39 0.00989281 8.53148e-39
    9
    2
    0.00744887 8.53148e-39 0.00989281 8.53148e-39
    9
    3
    0.00819374 8.53148e-39 0.00989281 0.0128612
    9
    4
    0.000119622 8.53148e-39 0.0098932 0.590703
    9
    5
    0.00819374 8.53148e-39 0.00989281 8.53148e-39
    9
    6
    0.00819374 8.53148e-39 0.00989281 8.53148e-39
    9
    7
    0.00819374 8.53148e-39 0.00989281 0.00154241
    9
    8
    0.0074488 8.53148e-39 0.00989281 0.070842
    9
    9
    0.00744887 8.53148e-39 0.00989281 8.53148e-39
    9
    10
    0.00744887 8.53148e-39 0.00989281 8.53148e-39
    9
    

    So you are simply overrunning the Dict.

    If you delete two lines from the input it will stop throwing exception.

    I don't know about shove, but a quick check in shell tells me it always returns a line keyed dict. There has to be a way to grow it...maybe there is a method or something like it...you should dig it's Doc more closely

    I just have a feeling you're using Shove in a wrong way.

    EDIT: It's kind of bizarre...after reviewing the Shove code, it turns out it should have synced it's memory content when buffer limit is reached...

    def __setitem__(self, key, value):
            self._cache[key] = self._buffer[key] = value
            # when buffer reaches self._limit, write buffer to store
            if len(self._buffer) >= self._sync:
                self.sync()
    

    EDIT 2

    Well i was totally wrong in my earlier point...But I've got some interesting pointer. One of the problem with is, is that shove raised a confusing exception...

    The real exception happened because ...

    def __setitem__(self, key, value):
       118          # (per Larry Meyn)
       119          try:
       120              with open(self._key_to_file(key), 'wb') as item:
       121                  item.write(self.dumps(value))
       122          except (IOError, OSError):
       123              raise KeyError(key)
    

    So the exception actually came from open system call. That means it has troubles writing files. I have a new suspicion with the length of the string...

    The look of the storage folder...

     aftnix@dev:~⟫ ls -l storage/                                                                                                                                   
        total 36
        -rw-rw-r-- 1 aftnix aftnix 49 ডিসে   4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%21+%21+%21
    
    -rw-rw-r-- 1 aftnix aftnix 52 ডিসে   4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%2C+%21+%21+%21+is+pronounced
    

    So shove is using key as file name. So it might get very ugly as your string is very large in the last two entries, especially the penultimate entry. So for a test i deleted some characters from the last two lines of the input. And the code ran as expected without any exception.

    Linux kernel has a limit of file name length....

    aftnix@dev:~⟫ cat /usr/include/linux/limits.h 
    #ifndef _LINUX_LIMITS_H
    #define _LINUX_LIMITS_H
    
    #define NR_OPEN         1024
    
    #define NGROUPS_MAX    65536    /* supplemental group IDs are available */
    #define ARG_MAX       131072    /* # bytes of args + environ for exec() */
    #define LINK_MAX         127    /* # links a file may have */
    #define MAX_CANON        255    /* size of the canonical input queue */
    #define MAX_INPUT        255    /* size of the type-ahead buffer */
    #define NAME_MAX         255    /* # chars in a file name */
    

    So to get around it you have to do something else. You can't put the vanilla parsed key into Shove.