Search code examples
performancegraphlabsframe

Converting SFrames into input dataset Sframes


I have a pretty bad way to convert my input logs to the input dataset. I have an SFrame sf with the following format:

user_id     int
timestamp   datetime.datetime
action      int
reasoncode  str

action column takes up 9 values ranging from 1 to 9.

So, every user_id can perform more than 1 action, more than once.

I am trying to obtain all unique user_id from sf and create an op_sf in the following manner:

y = 225

def calc_class(a,x):
  diffd = a['timestamp'].apply(lambda x: (dte - x).days)
  g = 0
  b = 0
  for i in diffd:
    if i > y:
    g += 1
  else:
    b += 1
  if b>= x:
    return 4
  elif b!= 0:
    return 3
  elif g>= 0:
    return 2
  else:
    return 1

l1 = []
ids = z['user_id'].unique()

for idd in ids:
 temp = sf[sf['user_id']== idd]
 zero1 = temp[temp['action'] == 1]
 zero2 = temp[temp['action'] == 2]
 zero3 = temp[temp['action'] == 3]
 zero4 = temp[temp['action'] == 4]
 zero5 = temp[temp['action'] == 5]
 zero6 = temp[temp['action'] == 6]
 zero7 = temp[temp['action'] == 7]
 zeroh8 = temp[temp['reasoncode'] == 'xyz']
 zero9 = temp[temp['reasoncode'] == 'abc']
 /* I'm getting clas1 to clas9 from function calc_class for each action
    clas1 to clas9 are 4 integers ranging from 1 to 4
 */ 
 clas1 = calc_class(zero1,2)
 clas2 = calc_class(zero2,2)
 clas3 = calc_class(zero3,2)
 clas4 = calc_class(zero4,2)
 clas5 = calc_class(zero5,2)
 clas6 = calc_class(zero6,2)
 clas7 = calc_class(zero7,2)
 clas8 = calc_class(zero8,2)
 clas9 = calc_class(zero9,2)
 l1.append([idd,clas1,clas2,clas3,clas4,clas5*(-1),clas6*(-1),clas7*(-1),clas8*(-1),clas9])

I wanted to know if this is the fastest way of doing this. Specifically if it is possible to do the same thing without generating the zero1 to zero9 SFrames.

An example sf:

user_id timestamp action reasoncode 
574 23/09/15 12:43  1   None
574 23/09/15 11:15  2   None
574 06/10/15 11:20  2   None
574 06/10/15 11:21  3   None
588 04/11/15 10:00  1   None
588 05/11/15 10:00  1   None
555 15/12/15 13:00  1   None
585 22/12/15 17:30  1   None
585 15/01/16 07:44  7   xyz
588 06/01/16 08:10  7   abc

l1 corresponding to the above sf:

574 1   2   2   0   0   0   0   0   0
588 3   0   0   0   0   0   0   0   3
555 3   0   0   0   0   0   0   0   0
585 3   0   0   0   0   0   0   3   0

Solution

  • I think your logic is relatively complex, but it's still more efficient to use column-wise operations on the whole dataset, rather than extracting the subset of rows for each user. The key tools are SFrame.groupby, SFrame.apply, SFrame.unstack, and SFrame.unpack. API docs here:

    https://dato.com/products/create/docs/generated/graphlab.SFrame.html

    Here's a solution that uses slightly simpler data than your example and slightly simpler logic to code the old vs. new actions.

    # Set up and make the data
    import graphlab as gl
    import datetime as dt
    
    sf = gl.SFrame({'user': [574, 574, 574, 588, 588, 588],
                    'timestamp': [dt.datetime(2015, 9, 23), dt.datetime(2015, 9, 23),
                                  dt.datetime(2015, 10, 6), dt.datetime(2015, 11, 4),
                                  dt.datetime(2015, 11, 5), dt.datetime(2016, 1, 6)],
                    'action': [1, 2, 3, 1, 1, 7]})
    
    # Count old vs. new actions.
    sf['days_elapsed'] = (dt.datetime.today() - sf['timestamp']) / (3600 * 24)
    sf['old_threshold'] = sf['days_elapsed'] > 225
    
    aggregator = {'total_count': gl.aggregate.COUNT('user'),
                  'old_count': gl.aggregate.SUM('old_threshold')}
    grp = sf.groupby(['user', 'action'], aggregator)
    
    # Code the actions according to old vs. new. Use your own logic here.
    grp['action_code'] = grp.apply(
                           lambda x: 2 if x['total_count'] > x['old_count'] else 1)
    grp = grp[['user', 'action', 'action_code']]
    
    # Reshape the results into columns.
    sf_new = (grp.unstack(['action', 'action_code'], new_column_name='action_code')
                 .unpack('action_code'))
    
    # Fill in zeros for entries with no actions.
    for c in sf_new.column_names():
        sf_new[c] = sf_new[c].fillna(0)
    
    print sf_new
    
    +------+---------------+---------------+---------------+---------------+
    | user | action_code.1 | action_code.2 | action_code.3 | action_code.7 |
    +------+---------------+---------------+---------------+---------------+
    | 588  |       2       |       0       |       0       |       2       |
    | 574  |       1       |       1       |       1       |       0       |
    +------+---------------+---------------+---------------+---------------+
    [2 rows x 5 columns]