I am using shove to avoid loading a huge dictionary into memory.
from shove import Shove
lemmaDict = Shove('file://storage')
with open(str(sys.argv[1])) as lemmaCPT:\
for line in lemmaCPT:
line = line.rstrip('\n')
lineAr = string.split(line, ' ||| ')
lineKey = lineAr[0] + ' ||| ' + lineAr[1]
lineValue = lineAr[2]
print lineValue
lemmaDict[lineKey] = lineValue
However, I'm getting the following KeyError and Traceback partway through reading lemmaCPT
. What's going on?
Traceback (most recent call last):
File "./stemmer.py", line 19, in <module>
lemmaDict[lineKey] = lineValue
File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 44, in __setitem__
self.sync()
File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 74, in sync
self._store.update(self._buffer)
File "/opt/Python-2.7.6/lib/python2.7/_abcoll.py", line 542, in update
self[key] = other[key]
File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/base.py", line 123, in __setitem__
raise KeyError(key)
KeyError: '! ! ! \xd1\x87\xd0\xb8\xd1\x82\xd0\xb0\xd0\xb5\xd1\x82\xd1\x81\xd1\x8f \xd1\x82\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x80\xd0\xb0\xd1\x82\xd0\xbd\xd1\x8b\xd0\xbc \xd0\xbf\xd0\xbe\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5\xd0\xbc \xd0\xbb\xd1\x8e\xd0\xb1\xd0\xbe\xd0\xb3\xd0\xbe ||| ! ! ! is pronounced by'
Sample input:
! ! ! читается троекратным повторением ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.0128612
! ! ! читается троекратным повторением ||| ! ! ! ||| 0.000119622 8.53148e-39 0.0098932 0.590703
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced by ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.00154241
! ! ! читается троекратным повторением ||| , ! ! ! ||| 0.0074488 8.53148e-39 0.00989281 0.070842
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
Running code.py sampleinput
will yield the aforementioned KeyError and Traceback.
Well if this is the actual input then the problem is with length of LemmaDict
and the input
...
aftnix@dev:~⟫ cat input | wc -l
11
My changed code....
from shove import Shove
import sys
import string
lemmaDict = Shove('file://storage')
i = 0
with open(str(sys.argv[1])) as lemmaCPT:
for line in lemmaCPT:
line = line.rstrip('\n')
lineAr = string.split(line, ' ||| ')
lineKey = lineAr[0] + ' ||| ' + lineAr[1]
lineValue = lineAr[2]
print lineValue
print len(lemmaDict)
#print len(lemmaCPT)
i+=1
print i
#lemmaDict[lineKey] = lineValue
Gives the following output...
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
1
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
2
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
3
0.00819374 8.53148e-39 0.00989281 0.0128612
9
4
0.000119622 8.53148e-39 0.0098932 0.590703
9
5
0.00819374 8.53148e-39 0.00989281 8.53148e-39
9
6
0.00819374 8.53148e-39 0.00989281 8.53148e-39
9
7
0.00819374 8.53148e-39 0.00989281 0.00154241
9
8
0.0074488 8.53148e-39 0.00989281 0.070842
9
9
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
10
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
So you are simply overrunning the Dict
.
If you delete two lines from the input it will stop throwing exception.
I don't know about shove, but a quick check in shell tells me it always returns a line keyed dict. There has to be a way to grow it...maybe there is a method or something like it...you should dig it's Doc more closely
I just have a feeling you're using Shove
in a wrong way.
EDIT: It's kind of bizarre...after reviewing the Shove
code, it turns out it should have synced it's memory content when buffer limit is reached...
def __setitem__(self, key, value):
self._cache[key] = self._buffer[key] = value
# when buffer reaches self._limit, write buffer to store
if len(self._buffer) >= self._sync:
self.sync()
EDIT 2
Well i was totally wrong in my earlier point...But I've got some interesting pointer. One of the problem with is, is that shove
raised a confusing exception...
The real exception happened because ...
def __setitem__(self, key, value):
118 # (per Larry Meyn)
119 try:
120 with open(self._key_to_file(key), 'wb') as item:
121 item.write(self.dumps(value))
122 except (IOError, OSError):
123 raise KeyError(key)
So the exception actually came from open
system call. That means it has troubles writing files. I have a new suspicion with the length of the string...
The look of the storage
folder...
aftnix@dev:~⟫ ls -l storage/
total 36
-rw-rw-r-- 1 aftnix aftnix 49 ডিসে 4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%21+%21+%21
-rw-rw-r-- 1 aftnix aftnix 52 ডিসে 4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%2C+%21+%21+%21+is+pronounced
So shove
is using key as file name. So it might get very ugly as your string is very large in the last two entries, especially the penultimate entry. So for a test i deleted some characters from the last two lines of the input. And the code ran as expected without any exception.
Linux kernel has a limit of file name length....
aftnix@dev:~⟫ cat /usr/include/linux/limits.h
#ifndef _LINUX_LIMITS_H
#define _LINUX_LIMITS_H
#define NR_OPEN 1024
#define NGROUPS_MAX 65536 /* supplemental group IDs are available */
#define ARG_MAX 131072 /* # bytes of args + environ for exec() */
#define LINK_MAX 127 /* # links a file may have */
#define MAX_CANON 255 /* size of the canonical input queue */
#define MAX_INPUT 255 /* size of the type-ahead buffer */
#define NAME_MAX 255 /* # chars in a file name */
So to get around it you have to do something else. You can't put the vanilla parsed key into Shove
.