Search code examples
pythonpython-3.xpandasnested-loops

Python: how to get rid of nested loops?


I have 2 for loops, one after the other and I want to get rid of them somehow to improve code speed. My dataframe from pandas looks like this (headers represent different companies and the rows represent different users and 1 means that the user accessed that company, 0 otherwise):

   100  200  300  400
0    1    1    0    1
1    1    1    1    0

I want to compare each pair of companies in my dataset and for that, I created a list that contains all the ids of the companies. The code looks at the list takes the first company (base), then it pairs with every other company (peer), hence the second "for" loop. My code is the following:

def calculate_scores():
    df_matrix = create_the_matrix(df)
    print(df_matrix)
    for base in list_of_companies:
        counter = 0
        for peer in list_of_companies:
            counter += 1
            if base == peer:
                "do nothing"
            else:
                # Calculate first the denominator since we slice the big matrix
            # In dataframes that only have accessed the base firm
            denominator_df = df_matrix.loc[(df_matrix[base] == 1)]
            denominator = denominator_df.sum(axis=1).values.tolist()
            denominator = sum(denominator) - len(denominator)

            # Calculate the numerator. This is done later because
            # We slice up more the dataframe above by
            # Filtering records which have been accessed by both the base and the peer firm
            numerator_df = denominator_df.loc[(denominator_df[base] == 1) & (denominator_df[peer] == 1)]
            numerator = len(numerator_df.index)
            annual_search_fraction = numerator/denominator
            print("Base: {} and Peer: {} ==> {}".format(base, peer, annual_search_fraction))

EDIT 1 (added code explanation):

The metric is the following:

enter image description here

1) The metric that I am trying to calculate is going to tell me how many times 2 companies are searched together in comparison with all the other searches.

2) The code is first selecting all the users which have accessed the base firm (denominator_df = df_matrix.loc[(df_matrix[base] == 1)])line. Then it calculates the denominator which counts how many unique combinations between the base firm and any other searched firm by the user are there and since I can count the number of firms accessed (by the user), I can subtract 1 to get the number of unique links between the base firm and the other firms.

3) Next, the code filters the previous denominator_df to select only the rows which accessed the base and the peer firm. Since I need to count the number of users which accessed the base and the peer firm, I use the command: numerator = len(numerator_df.index) to count the number of rows and that will give me the numerator.

The expected output from the dataframe at the top is the following:

Base: 100 and Peer: 200 ==> 0.5
Base: 100 and Peer: 300 ==> 0.25
Base: 100 and Peer: 400 ==> 0.25
Base: 200 and Peer: 100 ==> 0.5
Base: 200 and Peer: 300 ==> 0.25
Base: 200 and Peer: 400 ==> 0.25
Base: 300 and Peer: 100 ==> 0.5
Base: 300 and Peer: 200 ==> 0.5
Base: 300 and Peer: 400 ==> 0.0
Base: 400 and Peer: 100 ==> 0.5
Base: 400 and Peer: 200 ==> 0.5
Base: 400 and Peer: 300 ==> 0.0

4) The sanity check to see if the code gives the correct solution: all the metrics between 1 base firm and all the other peer firms have to sum up to 1. And they do in the code I posted

Any suggestions or tips on which direction to go will be appreciated!


Solution

  • You might be looking for itertools.product(). Here is an example that is similar to what you seem to want to do:

    import itertools
    
    a = [ 'one', 'two', 'three' ]
    
    for b in itertools.product( a, a ):
        print( b )
    

    The output from the above code snippet is:

    ('one', 'one')
    ('one', 'two')
    ('one', 'three')
    ('two', 'one')
    ('two', 'two')
    ('two', 'three')
    ('three', 'one')
    ('three', 'two')
    ('three', 'three')
    

    Or you could do this:

    for u,v in itertools.product( a, a ):
        print( "%s %s"%(u, v) )
    

    The output is then,

    one one
    one two
    one three
    two one
    two two
    two three
    three one
    three two
    three three
    

    If you would like a list, you could do this:

    alist = list( itertools.product( a, a ) ) )
    
    print( alist )
    

    And the output is,

    [('one', 'one'), ('one', 'two'), ('one', 'three'), ('two', 'one'), ('two', 'two'), ('two', 'three'), ('three', 'one'), ('three', 'two'), ('three', 'three')]