Search code examples
pythonapache-sparkdictionarypysparkrdd

how do I load a dict type directly to an rdd


I have a dictory in python

{'609232972': 4, '975151075': 4, '14247572': 4, '2987788788': 4, '3064695250': 2}

how can I directly load this in rdd, without loosing the key value pair?

When I load it like this

usr_group = sc.parallelize(partition)
print(usr_group.take(5))

I just breaks up the key value pair and gives

['609232972', '975151075', '14247572', '2987788788', '3064695250']

I am expecting the RDD to break into

{'609232972': 4, '975151075': 4, '14247572': 4, '2987788788': 4, '3064695250': 2}

so that I can process the key-value pair together


Solution

  • Not sure what you want the rdd to have as a row, but here are three options:

    my_dict = {'609232972': 4, '975151075': 4, '14247572': 4, '2987788788': 4, '3064695250': 2}
    rdd1 = sc.parallelize([my_dict])
    rdd2 = sc.parallelize(list(my_dict.iteritems()))
    rdd3 = rdd2.map(lambda x: (dict([x])))
    print rdd1.collect()
    print rdd2.take(4)
    print rdd3.take(4)
    

    [{'2987788788': 4, '975151075': 4, '3064695250': 2, '14247572': 4, '609232972': 4}]

    [('2987788788', 4), ('975151075', 4), ('3064695250', 2), ('14247572', 4)]

    [{'2987788788': 4}, {'975151075': 4}, {'3064695250': 2}, {'14247572': 4}]