Search code examples
pythondictionarycounter

python: Dictionary counter items. How to optimize performance


I need to keep track of occurrences of items in parsed files in a structure as follows:

data= {'tomate': {'John': 2, 'mike': 2},
 'pasta': {'mike': 1},
 'wine': {'mike': 1, 'alex': 2}}

the dictionary starts as empty one. data = {} and its is being populated with values when they are not there and adding up users +1 if the user is there or =1 if the user is not yet there.

This is my code:

def modify(data, key, client):
try:
    # if both keys are there
    data[key][client] = data[key][client] +1
    #print(f"{key} is there. {client} is there.")
except:
    try: 
        data[key][client] = 1
        #print(f"{client} not there, but {key}")
    except:
        data[key]={}
        data[key][client] = 1
        #print(f"{client} and {key} added")
return data

It works:

key="wine"
client = "alex"
modify(d, key, client)

giving:

{'tomate': {'John': 2, 'mike': 2},
 'pasta': {'mike': 1},
 'wine': {'mike': 1, 'alex': 3}}

The question is if using try/except is not the way to go, i.e. not pythonic, because I am relying on exceptions for the code to work properly which I feel is a little bit weird and might make the whole code much slower.

Is there any other option to keep track of a counter like this that I might have a look to that might be faster? Performance is very important of course since I have to count hundreds of millions of times.


EDIT 1: Evaluating the speed of a proposed solution I have this comparison that strikes me since my solution is faster (and I would not expect that):

enter image description here

EDIT 2: This question has been closed despite of very well elaborated answers and being directed to finding a proper way to solve a problem. It is really annoying that people just close questions like this. How can someone say that this questions is opinion-based if we are talking about computing times?


Solution

  • With try/except version, the search for index is done only once (most of the time). It serves both the purpose of checking the presence of index, and altering the associated value. That is why your version is the fastest so far. Last one-liner from Andrej is very fast also, especially with lot of new indexes (because then, your version spend some time in except, whereas Andrej's one-liner doesn't really care)

    One notable, seemingly minor but important improvement to your version could be:

    def modify(data, key, client):
    try:
        # if both keys are there
        data[key][client] += 1
        #print(f"{key} is there. {client} is there.")
    except:
        try: 
            data[key][client] = 1
            #print(f"{client} not there, but {key}")
        except:
            data[key]={}
            data[key][client] = 1
            #print(f"{client} and {key} added")
    return data
    

    Yes, just that += changes quite a lot. Especially with lot of "updates" (I mean, not new index). Because x=x+1 means that python needs to figure out what is x twice. Whereas x+=1 finds the l-value only once. Which is roughly the same usually. Except when finding the place itself is a non-negligible part of the cost. As it is here, with indexation

    Some benchmarking on big sets of generated data Experiment 1: data with lot of new indexes.

    Version time
    Yours 5.11
    Mine (that is yours with +=) 4.44
    Andrej's 2-liner 6.21
    Andrej/Yevhen's if-else 5.69
    Get based 5.99
    Andrej's one-liner 5.08

    Experiment 2, with lots of collisions (that is not new index. That is other than 1 values...)

    Version time
    Yours 5.43
    My += 4.67
    Andrej's 2-liner 6.73
    if/else 6.23
    .get 6.51
    Andrej's 1-liner 5.60

    So, in both cases, your version, upgraded with my += suggestion, is the fastest. If lot of new indexes, then Andrej's one-liner slightly beats your version (without +=). If lot of existing indexes, it is roughly the same (again, because then you use more except clauses for new indexes).

    += improvement make your version the best in any cases.