python algorithm pandas data-science processing-efficiency

Algorithm Efficiency - Data Processing Efficiency using Pandas (three nested for loops)

The data comes in two data sets which I need to check for if a single time event in the first data set at a specific location coincides with a time range in the second dataset at the same specific location, and append the ID of the second set to the first accordingly if the conditions are met. I have a list of specific locations that I want to check.

My problem is that the first data set contains about 500,000 rows, and the second contains about 90,000 rows. Running through both data sets takes forever and my computing power is limited.

Here is the Python code:

import datetime
import pandas as pd

def assign_tRangeID(singleEventDF, timeRangeDF):
    margin = datetime.timedelta(minutes=15)
    for i, single in singleEventDF.iterrows():
        for j, timeRange in timeRangeDF.iterrows():
           if timeRange['start_time']-margin <= single['singleEvent_time'] <= timeRange['end_time']
               singleEventDF.at[i, 'tRange_ID'] = timeRangeDF['ID']

for i, location in location_list.iterrows():
    single_subset = singleEvent['loc'].loc[[singleEvent['loc'] = location['loc']]
    tRange_subset = timeRange['loc'].loc[[timeRange['loc'] = location['loc']]
    assign_eventID(single_subset, tRange_subset)

I am a beginner in Python, so I'm just wondering if I can do this in a more efficient manner without having to use a database or some big data solution. Thanks for all the help!

Solution

This is a bit of a fun algorithmic problem when you strip away the DataFrame machinery. To answer your question, YES this can be made faster. I'm going to restate your problem slightly so that the solution can be more applicable to more people. It shouldn't take much work to refactor it to fit into the data structure you're using.

Before getting started, I would like to point out that @NileshIngle's code could provide a substantial speed boost to yours (I haven't benchmarked anything yet), but the time complexity is still quadratic for EVERY case, not just in the worst case. That fact is hidden in the various pandas function calls he uses, but the code ALWAYS touches every time range for every single time. Given the size of the dataset you mentioned, that is not likely to be the solution you are looking for except in very special cases.

Disclaimer: I think this problem is solvable with a worst-case complexity of nlog(n)+mlog(m), if m and n are the sizes of the respective inputs. My solution achieves that complexity on average but not in the worst case. Anyone want to come up with something better?

Given a list of single times and a list of time ranges, e.g.

single_times = [4, 5, 2, 3, -1]
time_ranges = [(1, 5), (10, 11), (2, 3)]

Can we design an algorithm faster than O(len(t)len(r)) which outputs for each element in t the index of every matching time range in r? For this problem (taking into account that your example includes the endpoints), that output would be:

res = [[0], [0], [0, 2], [0, 2], []]

At first glance, the problem seems to be that for each element of single_times we have to check every element of time_ranges, leading to an absurd runtime for large amounts of data. For general kinds of data where we want to merge two lists, this quadratic runtime cannot be avoided. However, the fact that we can easily sort both of these lists gives us better computational bounds.

Exploring this idea, what happens if single_times is sorted in ascending order? For example, what if we know that the time ranges corresponding to a time of 3 are [(1,5),(2,3)] and we want to know the time ranges corresponding to 4? We lose the range (2,3) since the end time 3 is less than 4, and we don't gain any more time ranges.

We're going to go ahead and apply that idea to create a rudimentary sort-based algorithm attempting to match time ranges to times. In your application, you don't really need the returned values to be in the same order as long as you have the object reference, but we're going to go ahead and keep track of the original locations of everything. Given the choice I'd use numpy for efficiency and a variety of convenience functions, but raw Python is more portable.

import itertools as it

def matching_times(single_times, time_ranges):
    single_index = sorted(xrange(len(single_times)), key=lambda i: single_times[i])
    single_times_sorted = [single_times[i] for i in single_index]
    time_ranges_sorted = sorted([(i, v[0], v[1]) for i, v in enumerate(time_ranges)], key=lambda w: w[1])

    m = 0  # keep track of min location in time_ranges_sorted
    res = [[]]

    # Find solutions for single_times_sorted[0]
    for i, w in enumerate(time_ranges_sorted):
        if w[1] > single_times_sorted[0]:
            break
        if w[2] >= single_times_sorted[0]:
            res[0].append(w)
            m = i+1

    for cur_time in it.islice(single_times_sorted, 1, len(single_times_sorted)):
        # Keep previous solutions that don't end too soon
        res.append([w for w in res[-1] if w[2]>=cur_time])

        # Strip extraneous information as soon as possible to preserve a semblance
        # of memory efficiency
        res[-2] = [w[0] for w in res[-2]]

        for i, w in enumerate(it.islice(time_ranges_sorted, m, len(time_ranges_sorted)), m):
            if w[1] > cur_time:
                break
            if w[2] >= cur_time:
                res[-1].append(w)
                m = i+1

    # Strip remaining extra information from solution
    res[-1] = [w[0] for w in res[-1]]

    # Re-sort result according to original locations in single_times
    return [v[1] for v in sorted(enumerate(res), key=lambda v: single_index[v[0]])]

The desired solution is then obtained quite simply:

res = matching_times(single_times, time_ranges); res
>>> [[0], [0], [0, 2], [0, 2], []]

This still has a worst-case quadratic time complexity, but for real-world data which likely doesn't have many matching time ranges per single time relative to the total number of time ranges the expected runtime would be closer to O(nlog(n)+mlog(m)) with m and n the lengths of the two respective input lists.