Search code examples
pandaspython-3.7

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?


I think I have a problem with time calculation.

I want to run this code on a DataFrame of 320 000 lines, 6 columns:

index_data = data["clubid"].index.tolist()

for i in index_data:
    for j in index_data:
        if data["clubid"][i] == data["clubid"][j]:
            if data["win_bool"][i] == 1:
                if (data["startdate"][i] >= data["startdate"][j]) & (
                    data["win_bool"][j] == 1
                ):
                    NW_tot[i] += 1
            else:
                if (data["startdate"][i] >= data["startdate"][j]) & (
                    data["win_bool"][j] == 0
                ):
                    NL_tot[i] += 1

The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.

The problem is, I don't get an error, but I never obtain any results either. When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.

I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.

Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.

Could you help me with these issues, please?


Solution

  • As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?

    import pandas as pd
    dat = pd.DataFrame({
        "win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
        "clubid":  [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
        "date"  :  [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
        "othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
        })
    
    
    temp = dat[["clubid", "win_bool"]].groupby("clubid")
    NW_tot = temp.sum()
    NL_tot = temp.count()
    NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
    

    If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):

    # drop duplicate dates
    temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")