Search code examples
pythonpandasnumpyfor-loopnested-loops

Method to avoid this nested for-loop


Is a nested for-loop necessary for this code, or is there a more efficient work-around?

This is a simplified version which searches for successive, overlapping intervals within a data set comprised of 20 random integers from 1 to 1000. It runs through error values of 1-100 to create the intervals by adding/subtracting them from the 20 random integers.


Example:

input assuming data frame of size 10 instead of 20:

df = [433, 3, 4, 5, 6, 7, 378, 87, 0, 500]

output for error = 1 in for-loop:

overlaps = {0:[[1, 2, 3, 4, 5]]}

def find_overlap(df, error):
    """
    df: dataframe with random 20 integers from 1-1000
    error: used to create the interval by +/- to each value in the dataframe
    returns: list of list of indexes overlapping
    """

    # add the interval to the dataframe as columns of minimum and maximum
    df["min"] = df["x"] - error
    df["max"] = df["x"] + error

    # overlaps stores lists of indexes that overlap
    overlaps = []

    # fill in data for start
    temporary = [0]
    minimum = df["min"].iloc[0]
    maximum = df["min"].iloc[0]

    # iterates through the dataframe checking for overlap between successive intervals
    for index , row in df.iterrows():
        current_min = row["min"]
        current_max = row["max"]

        # yes overlap
        if (current_min <= maximum) and (current_max >= minimum):
            temporary.append(index)
            if current_min > minimum:
                minimum = current_min
            if current_max < maximum:
                maximum = current_max
            continue

        # no overlap - also check for 5 successive overlaps
        if len(temporary) >= 5:
            overlaps.append(temporary)
        temporary = [index]
        minimum = current_min
        maximum = current_max

    return overlaps



# creates dataframe with 20 random integers from 1 to 1000
df = pd.DataFrame(np.random.randint(1, 1000, 20), columns=["x"])

overlaps = {}
for error in range(0,100):
    lst = find_overlap(df, error)
    if len(lst):
        overlaps[error] = lst

print(overlaps)

Solution

  • So, from what I've understood out of your code... You're looking to:

    1. Compute the difference between all values of x.
    2. Determine if it's less than error where error ranges from [0, 100).
    3. Select all subarrays of size 5.

    Assuming my interpretation is correct... You can actually vectorize this and avoid for-loops, like your intuition led you to believe. Ultimately, if my interpretation is incorrect, this should, at least, give you a decent start to creating a vectorized version of your desired code. 🙂

    Updated Solution (considers 5-tuples)

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.randint(1, 1000, 20), columns=["x"])
    
    overlaps = {}
    
    for margin in range(0, 100):
        diffs = np.abs(df["x"].values - np.roll(df["x"], margin))
        # np.convolve is analogous to a sliding window sum
        quint = np.convolve(diffs == margin, np.ones(5), "valid")
        index = np.nonzero(quint == 5)[0]
        if index.size > 0:
            overlaps[margin] = [list(range(i, i + 5)) for i in index]
    

    Original Solution (doesn't consider 5-tuples)

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.randint(1, 1000, 20), columns=["x"])
    
    overlaps = {}
    
    for margin in range(0, 100):
        diffs = np.abs(df["x"].values - np.roll(df["x"], margin))
        index = np.nonzero(diff == margin)[0]
        if idx.size > 0:
            overlaps[margin] = idx
    

    In case you're unfamiliar with numpy, .size gives you the total size of the ndarray. (So a 3D array with shapes (10, 20, 30) has a size of 6000.)