Search code examples
graphlabsframe

Graphlab Sframes - How to retain all columns in groupby


I have a sframe where I want to do a groupby with some operator on a column. But, this returns an sframe only with key columns specified. How can I do the operation on some columns, but keep all the columns nonetheless?


Solution

  • To the best of my understanding from your question, you want to do operations on column without loosing their initial state. The below example may illustrate. Suppose we have a movie dataset as SFrame sf :-

    movieId    userId    actors    rating
    102        10        A,B,C      5
    204        8         B,C,D      4
    333        3         K,L,M      3
    204        11        P,Q,R      1
    423        3         K,B,C      4    
    533        31        K,A,C      2    
    633        3         P,L,A      3
    .
    .
    ...
    

    In the above SFrame, user 3 gave multiple rating, so you may work on user's rating mean as

     rating_stats = sf.groupby(key_columns='userId',operations {'mean_rating': agg.MEAN('rating')})
    

    Then, you may like to add the found column in SFrame without affecting already present columns, i.e you can retain SFrame.

    sf['mean_rating'] = rating_stats['mean_rating']
    

    You will find that sf is not affected and you added a new column.

    Now answer to your question can be, if you are using groupby() method, its better to have a separate SFrame where you are specific to the operation, and you may further use or add to the original SFrame, or maybe merge rest of columns to your found SFrame using groupby() method or you can also use join on found SFrame, but its not a good practice to keep changing original SFrame to operate.

    Also, note that for multiple entities in a column like in actors in SFrame, method that can make things easy is using stack method before using groupby() to operate on data. I hope that helps.