Search code examples
pythonpython-2.7pysparkdefaultdict

Python default dictionary seems to be giving duplicate key - what is happening?


My code below which returns data as default dictionary in Python is giving output like:

defaultdict(<type 'list'>, {'[0, 13, 26, 39]': ['1']})                          
defaultdict(<type 'list'>, {'[0, 13, 26, 39]': ['1']})
defaultdict(<type 'list'>, {'[6, 19, 32, 45]': ['1']})
defaultdict(<type 'list'>, {'[3, 16, 29, 42]': ['1']})

How is it possible to get duplicate keys in the above output?

Shouldn't it be like:

defaultdict(<type 'list'>, {'[0, 13, 26, 39]': ['1', '1']})                          
defaultdict(<type 'list'>, {'[6, 19, 32, 45]': ['1']})
defaultdict(<type 'list'>, {'[3, 16, 29, 42]': ['1']})

The code I am running is

def make_bands(value):
    d2 = defaultdict(list)
    for key, val in value.iteritems():
        d2[(str(list(val[0:4])))].append("1")
    
    print d2

value is another dictionary

The function make_bands is called to process Spark RDD as following:

signatureBands = signatureTable.map(lambda x: make_bands(x)).collect()

Solution

  • First, no, you cannot expect the output to be what you want it to be. d2 is not kept between calls. It's created anew every time you enter the function. You can still get what you want if you use a class to keep the state, a generator (this will be less elegant here), or a function which constructs a function instead of a lambda (that would be my choice here):

    def build_make_bands():
        d2 = defaultdict(list)
        def make_bands(value):
            for key, val in value.iteritems():
                d2[(str(list(val[0:4])))].append("1")
            print d2
        return make_bands
    

    And then you'd call it like this:

     signatureTable.map(build_make_bands()).collect()