My code below which returns data as default dictionary in Python is giving output like:
defaultdict(<type 'list'>, {'[0, 13, 26, 39]': ['1']})
defaultdict(<type 'list'>, {'[0, 13, 26, 39]': ['1']})
defaultdict(<type 'list'>, {'[6, 19, 32, 45]': ['1']})
defaultdict(<type 'list'>, {'[3, 16, 29, 42]': ['1']})
How is it possible to get duplicate keys in the above output?
Shouldn't it be like:
defaultdict(<type 'list'>, {'[0, 13, 26, 39]': ['1', '1']})
defaultdict(<type 'list'>, {'[6, 19, 32, 45]': ['1']})
defaultdict(<type 'list'>, {'[3, 16, 29, 42]': ['1']})
The code I am running is
def make_bands(value):
d2 = defaultdict(list)
for key, val in value.iteritems():
d2[(str(list(val[0:4])))].append("1")
print d2
value is another dictionary
The function make_bands
is called to process Spark RDD as following:
signatureBands = signatureTable.map(lambda x: make_bands(x)).collect()
First, no, you cannot expect the output to be what you want it to be. d2 is not kept between calls. It's created anew every time you enter the function. You can still get what you want if you use a class to keep the state, a generator (this will be less elegant here), or a function which constructs a function instead of a lambda (that would be my choice here):
def build_make_bands():
d2 = defaultdict(list)
def make_bands(value):
for key, val in value.iteritems():
d2[(str(list(val[0:4])))].append("1")
print d2
return make_bands
And then you'd call it like this:
signatureTable.map(build_make_bands()).collect()