Search code examples
pythonpandasoptimizationmodin

How to optimize this pandas iterable


I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules:

def disambiguate(arg):

    arg['length'] = (arg.end - arg.begin).abs()

    df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()

    data = []
    out = pd.DataFrame()
    for row in df.itertuples():

        test = df[df['note_id']==row.note_id].copy()

        # get overlapping intervals: 
        # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
        iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
        span_range = pd.Interval(row.begin, row.end)
        fx = test[iix.overlaps(span_range)].copy()

        maxLength = fx['length'].max()
        minLength = fx['length'].min()
        maxScore = abs(float(fx['score'].max()))
        minScore = abs(float(fx['score'].min()))

        # filter out overlapping rows via hierarchy 
        if maxScore > minScore:
            fx = fx[fx['score'] == maxScore]

        elif maxLength > minLength:
            fx = fx[fx['length'] == minScore]

        data.append(fx)

    out = pd.concat(data, axis=0)

    # randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
    out.reset_index(inplace=True)
    out = out.reindex(np.random.permutation(out.index))

    return out.drop_duplicates(subset=['begin', 'end', 'note_id'])

This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there).

Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form.

Here is a sample of my data:

begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222

Solution

  • I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe.

    It should been:

        cases = set(df['note_id'].tolist())
    
        for case in cases:
            test = df[df['note_id']==case].copy()
            for row in df.itertuples():
    
                # get overlapping intervals: 
                # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
                iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
                span_range = pd.Interval(row.begin, row.end)
                fx = test[iix.overlaps(span_range)].copy()
    
                maxLength = fx['length'].max()
                minLength = fx['length'].min()
                maxScore = abs(float(fx['score'].max()))
                minScore = abs(float(fx['score'].min()))
    
                if maxScore > minScore:
                    fx = fx[fx['score'] == maxScore]
    
                elif maxLength > minLength:
                    fx = fx[fx['length'] == maxLength]
    
                data.append(fx)
    
            out = pd.concat(data, axis=0)
    

    For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!