Search code examples
pythonmultithreadingalgorithmcombinationsprocessing-efficiency

Efficient combination of dataframe rows with itself


I have a python dataframe "a,b,c,d,...z". And I want to get all possible combinations: "aa, ab, ac, ad,.. az" then "ba, bb, bc, bd,... bz" and so on.

What I have done is a simple nested for

for index, d1 in d.iterrows():
    for index2, d2 in d.iterrows():
        #do stuff

The code above works fine. However, the dataframe is very big (50000 rows) and I am trying to be very efficient (and now I clearly am not). During these for loops, I also get the combinations "ab" and "ba" which is the same thing for what I am doing. Lets say, on

ab, ac, ad, ba, bc, bd, ca, cb, cd, da, db, dc

the combinations

ab-ba, ac-ca, ad-da, bc-cb, bd-db, cd-dc

are the same.

So, for the above reason:

FIRST: I am thinking to iterate only between the first half. Meaning that now what is happening, is a combination of each 50000 rows with another 50000 rows. To cut down some calculations I will combine the first 25000 rows with all 50000 of the table. Still not avoiding any unnecessary combinations but, would that make sense and still return every combination in less time? Is there any already implemented algorithm that I could study?

SECOND: I tried to implement multiprocessing (I do have a good multicore/thread processor) because nothing in the combination relies in a previous calculation and in my mind I think it is a good way to go to increase performance. However I was unsuccessful on doing so. What would you suggest? Library/method?

What else could I do to be more efficient and increase performance?

(Just for the curious, I have a project to make some unique lettered phrases which means I will run the above algorithm several times and I will need all the performance I can get)


Solution

  • I think what you are looking for is combinations from itertools, a package from the standard library.

    from itertools import combinations
    
    for d1, d2 in combinations(df['column name'], 2):
        # do stuff