Search code examples
python-2.7datetimetimedeltagraphlabbigdata

Datetime in python - speed of calculations - big data


I want to find the difference (in days) between two columns in a dataframe (more specifically in the graphlab SFrame datastructure).

I have tried to write a couple of functions to do this but I cannot seem to create a function that is fast enough. Speed is my issue right now as I have ~80 million rows to process.

I have tried two different functions but both are too slow:

The t2_colname_str and t1_colname_str arguments are the column-names of which I want to use, and both columns contain datetime.datetime objects.

For Loop

def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
    import graphlab as gl
    import datetime as datetime

    # creating the new column name to be used later
    new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
    diff_days_list = []

    for i in range(len(sframe_obj[t2_colname_str])):
        t2 = sframe_obj[t2_colname_str][i]
        t1 = sframe_obj[t1_colname_str][i]
        try:
            diff = t2 - t1
            diff_days = diff.days
            diff_days_list.append(diff_days)
        except TypeError:
            diff_days_list.append(None)

    sframe_obj[new_colname] = gl.SArray(diff_days_list)

List Comprehension

I know this is not the intended purpose of list comprehensions, but I just tried it to see if it was faster.

def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
    import graphlab as gl
    import datetime as datetime

    # creating the new column name to be used later
    new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])

    diff_days_list = [(sframe_obj[t2_colname_str][i]-sframe_obj[t1_colname_str][i]).days if sframe_obj[t2_colname_str][i] and sframe_obj[t1_colname_str][i] != None else None for i in range(len(sframe_obj[t2_colname_str]))]

    sframe_obj[new_colname] = gl.SArray(diff_days_list)

Additional Notes

I have been using GraphLab-Create by Dato and their SFrame data-structure mainly because it parallelizes all the computation which makes my analysis super-fast and it has a great library for machine learning applications. It's a great product if you haven't checked it out already.

GraphLab User Guide can be found here: https://dato.com/learn/userguide/index.html


Solution

  • I'm glad you found a workable way for you, however SArrays allow vector operations, so you don't need to loop through every element of the column. SArrays will iterate, but they're REALLY slow at that.

    Unfortunately, SArrays don't support vector operations on datetime types because they don't support a "timedelta" type. You can do this though:

    diff = sframe_obj[t2_colname].astype(int) - sframe_obj[t1_colname].astype(int)
    

    That will convert the columns to a UNIX timestamp and then do a vectorized difference operation, which should be plenty fast...at least faster than a conversion to NumPy.