Search code examples
pythontimeintervalssimilaritypdist

Python - Compute similarities between items based on temporal information (instant, interval)


I want to compute similarity between items (0,1,2,3..) based on their temporal information. Temporal information may be time instant (startdate), time interval (startdate,enddate) or null (NaT); see an example of dataframe (df_for) below.

enter image description here

  1. {instant,instant} : if equal sim = 1, else sim =0
  2. {instant,null} or vice versa, sim =0
  3. {instant, interval}: if instant within interval, sim =1 or if an interval contains an instant, sim = 1
  4. {interval,interval} : if intervals overlaps, sim = intersection of both intervals / union of both intervals
  5. {interval,interval} : if an interval contains another, then sim = 1

The following python code obtains temporal information from the dataframe and performs the conditions above (1-5). The code is verbose, i wonder if there is a smart way/lib to calculate similarity between time periods and time instants using python.

m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
    for j in range(m):
        if i != j:
            st1 = data[i][0]
            ed1 = data[i][1]
            st2 = data[j][0]
            ed2 = data[j][1]
            #both items are null values
            if pd.isnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
                sim[i][j] = 0.
            # {instant, instant} => equal, not equal
            if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
                if st1 == st2:
                    sim[i][j] = 1.
                else:
                    sim[i][j] = 0.
            # {instant, null} => sim is 0
            if pd.notnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
                sim[i][j] = 0.
            # {instant, interval} => meets, during
            if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
                    if(st2 <= st1 <= ed2):
                        sim[i][j] = 1. #a time is between two other times
                    else:
                        sim[i][j] = 0.
            # {interval, instant} => meets, contains
            if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
                    if(st1 <= st2 <= ed1):
                        sim[i][j] = 1. #a time is between two other times
                    else:
                        sim[i][j] = 0.
            # {interval, interval} => equal, overlaps, not overlaps
            if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.notnull(ed2): 
                if (st1 <= st2 <= ed1) or (st2 <= st1 <= ed2):
                    intersect = min(ed1,ed2)- max(st1,st2) # earliestend-lateststart
                    union = max(st1,st2,ed1,ed2) - min(ed1,ed2,st1,st2)
                    overlaps = intersect/union
                    #print(intersect/np.timedelta64(1, 'D'),union/np.timedelta64(1, 'D'))
                    if (st1 > st2 and ed1 < ed2) or (st1 < st2 and ed1 > ed2): # contains, during
                        overlaps = 1.0
                    sim[i][j]=overlaps  
                else:
                    sim[i][j] = 0.  
        else:
            sim[i][j] = 1.

Solution

  • I can see several ways to simplify the code.

    1. First, I'd suggest moving the comparison code to a separate function that takes st1, ed1, st2, and ed2 as arguments. (As a side note, why st (start time) but ed (end date)? Consistent names might be nice.) You'd be able to return your result to the calling code, which would be responsible for doing the assignment into the array of results.

    2. Speaking of that calling code... It doesn't need to call the comparison function for every pair of time-ranges. The results should always be symmetrical (e.g. compare(data[i][0], data[i][1], data[j][0], data[j][1]) is going to return the same thing as compare(data[j][0], data[j][1], data[i][0], data[i][1])). So just call one of those, and assign the results to both sim[i][j] and sim[j][i].

    3. Rather than lots of if some_test_expression: return 1; else: return 0 blocks, just return the results of the comparison: return some_test_expression. Python's bool type is a subclass of int, so these should convert to numbers automatically when you assign them into a numpy array (you could also cast to float explicitly if you want the function to always return the same type).

    4. Handle all cases with nulls at the top. They're an easy case, and if you deal with them first (and return, so the rest of the function doesn't run), you don't need to check as many things for null later. You don't need to handle null/null, null/instantand null/interval separately, just return zero if either start value is null: if isnull(st1) or isnull(st2): return 0.

    5. Instant/interval comparisons are symmetrical, so just write the logic one way and flip the parameters around if they're initially in the wrong order: if isnull(ed2): st1, ed1, st2, ed2 = st2, ed2, st1, ed1

    6. I don't see too much that can be improved with the specific implementations of your comparisons between instants and intervals or between two intervals (other than the general things mentioned already above). However, I would like to say that your formula for overlapping intervals will have a big discontinuity in similarity as a small interval slowly overlaps more with a larger one, then gets fully contained. For instance, a one second interval that starts just a millisecond before a one hour interval will result in a similarity value between the two of 0.000277 (0.999 / (3600 + 0.001)), but one that starts at the same time as the larger interval does will have similarity of 1.0 (a big difference). A more continuously changing measure of similarity would come from dividing the overlap amount by the size of the smaller interval (rather than the size of the union). This formula doesn't need a special case for one interval fully containing another, since the default calculation will give a similarity of 1.0 already (the overlap will be the size of the smaller interval, so you'll be dividing it by itself).

    So, putting all that together, here's how I'd rewrite things:

    def compare_intervals(st1, et1, st2, et2): # note I've renamed the ed's to et
        # all nulls
        if pd.isnull(st1) or pd.isnull(st2):
            return 0.
    
        # {instant, instant} => equal, not equal
        if pd.isnull(et1) and pd.isnull(et2):
            return float(st1 == st2)
    
        # {instant, interval} => flip params (to be handled in next block)
        if pd.isnull(et1):
            st1, et1, st2, et2 = st2, et2, st1, et1
    
        # {interval, instant} => meets, contains
        if pd.isnull(et2):
            return float(st1 <= st2 <= et1)
    
        # {interval, interval} => equal, overlaps, not overlaps
        if (st1 <= st2 <= et1) or (st2 <= st1 <= et2):
            intersect = min(et1, et2) - max(st1, st2)
            min_interval = min(et1 - st1, et2 - st2) # using minimum instead of union
            return intersect / min_interval
    
        return 0. # intervals didn't overlap
    
    m, k = df_for.shape
    sim = np.zeros((m, m))
    data = df_for.values
    for i in range(m):
        for j in range(i, m): # we only iterate on j values >= i
            if i == j:
                sim[i,j] = 1.
            else:
                sim[i,j] = sim[j,i] = compare_intervals(data[i][0], data[i][1],
                                                        data[j][0], data[j][1])
    

    A few notes that I couldn't fit into comments in the code:

    1. No test for the values being intervals is needed at the last step of the compare_interval function, since all other cases (involving instants and nulls) have already been handled.

    2. I've changed the assignments to the sim array to use tuple indexes, as that's more natural than nested indexes for multidimensional numpy arrays (slices don't work if you use nested indexes). You might be able to do the same thing with the data lookups, but I've not changed them since I don't know pandas nearly as well as I know numpy.

    3. Speaking of my lack of pandas knowledge, there is probably a way to apply the comparison function directly to a "join" of the dataframe with itself. I don't know how to do that though. If it exists, it will probably be faster than the nested loops I left mostly unchanged in the calling code (even if it doesn't optimize out the symmetric cases).