Search code examples
pythonapache-sparkpysparkrdd

removing , and converting to int


I am banging my head to convert the following spark RDD data using code

[('4', ('1', '2')),
('10', ('5',)),
('3', ('2',)),
('6', ('2', '5')),
('7', ('2', '5')),
('1', None),
('8', ('2', '5')),
('9', ('2', '5')),
('2', ('3',)),
('5', ('4', '2', '6')),
('11', ('5',))]

def adjDang(line, tc):
    node, edges = line
    print(f'node {node} edges {edges}')
    if edges == None:
        return (int(node),(0,0))
    else:
        if len(edges) == 1:
            newedges = (edges[0]) #remove the comma which is unnecessary check '11'
        else:
            newedges = ()
            for i in range(len(edges)):
                newedges += edges[i]

        print(f'node {node} edge{newedges}')

        return(int(node), (1/tc, newedges))

I am getting the following output

[(4, (0.09090909090909091, ('1', '2'))),
 (10, (0.09090909090909091, '5')),
 (3, (0.09090909090909091, '2')),
 (6, (0.09090909090909091, ('2', '5'))),
 (7, (0.09090909090909091, ('2', '5'))),
 (1, (0, 0)),
 (8, (0.09090909090909091, ('2', '5'))),
 (9, (0.09090909090909091, ('2', '5'))),
 (2, (0.09090909090909091, '3')),
 (5, (0.09090909090909091, ('4', '2', '6'))),
 (11, (0.09090909090909091, '5'))]

The expectation is to get the output in the format (node_id , (score, edges)) so for example for node 5, it should look like (5, (0.09090909090909091, 4, 2, 6)). those extra brackets should go away so that it looks like 1 single tuple after the node and the edges should be integers.

Appreciate any pointers on how to achieve this please


Solution

  • If you're using Python 3.5 or above, just change the return statement to

    return(int(node), (1 / tc, *newedges))
    

    (Same as what you have but with a *)