Search code examples
pythonperformancelistdictionarygraph-tool

Iteratively count elements in list and store count in dictionary


I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. For each node my code returns me a list, b containing integer values giving me the path length for every possible connection. I want to count the number of occurences of given path lengths so I can create a histogram.

local_path_length_hist = {}
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        if dist in local_path_length_hist:
            local_path_length_hist[dist]+=1
        else:
            local_path_length_hist[dist]=1

This presumably is very crude coding as far as the dictionary update is concerned. Is there a better way of doing this? What is the most efficient way of creating this histogram?


Solution

  • Since gt.shortest_distance returns an ndarray, numpy math is fastest:

    max_dist = len(vertices) - 1
    hist_length = max_dist + 2
    no_path_dist = max_dist + 1
    hist = np.zeros(hist_length) 
    for ver in vertices:
        dist = gt.shortest_distance(g, source=g.vertex(ver))
        hist += np.bincount(dist.a.clip(max=no_path_dist))
    

    I use the ndarray method clip to bin the 2147483647 values returned by gt.shortest_distance at the last position of hist. Without use of clip, hist's size would have to be 2147483647 + 1 on 64-bit Python, or bincount would produce a ValueError on 32-bit Python. So the last position of hist will contain a count of all non-paths; you can ignore this value in your histogram analysis.


    As the below timings indicate, using numpy math to obtain a histogram is well over an order of magnitude faster than using either defaultdicts or counters (Python 3.4):

    # vertices      numpy    defaultdict    counter
        9000       0.83639    38.48990     33.56569
       25000       8.57003    314.24265    262.76025
       50000      26.46427   1303.50843   1111.93898
    

    My computer is too slow to test with 9 * (10**6) vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect).


    timing code:

    from collections import defaultdict, Counter
    import numpy as np
    from random import randint, choice
    from timeit import repeat
    
    # construct distance ndarray such that:
    # a) 1/3 of values represent no path
    # b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
    num_vertices = 50000
    no_path_length = 2147483647
    distances = []
    for _ in range(num_vertices):
        rand_dist = randint(0,(num_vertices-1))
        distances.append(choice((no_path_length, rand_dist, rand_dist)))
    dist_a = np.array(distances)
    
    def use_numpy_math():
        max_dist = num_vertices - 1
        hist_length = max_dist + 2
        no_path_dist = max_dist + 1
        hist = np.zeros(hist_length, dtype=np.int)
        for _ in range(num_vertices):
            hist += np.bincount(dist_a.clip(max=no_path_dist))
    
    def use_default_dict():
        d = defaultdict(int)
        for _ in range(num_vertices):
            for dist in dist_a:
                d[dist] += 1
    
    def use_counter():
        hist = Counter()
        for _ in range(num_vertices):
            hist.update(dist_a)
    
    t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
                    repeat=3, number=1))
    t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
                    repeat= 3, number=1))
    t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
                    repeat= 3, number=1))
    
    print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))