Search code examples
pythonpandasdataframegraphlabsframe

Appending data column from one SFrame to another SFrame


My training data train SFrame looks like this with 4 columns (the "Store" column is non-unique in this SFrame):

+-------+------------+---------+-----------+
| Store |    Date    |  Sales  | Customers |
+-------+------------+---------+-----------+
|   1   | 2015-07-31 |  5263.0 |   555.0   |
|   2   | 2015-07-31 |  6064.0 |   625.0   |
|   3   | 2015-07-31 |  8314.0 |   821.0   |
|   4   | 2015-07-31 | 13995.0 |   1498.0  |
|   3   | 2015-07-20 |  4822.0 |   559.0   |
|   2   | 2015-07-10 |  5651.0 |   589.0   |
|   4   | 2015-07-11 | 15344.0 |   1414.0  |
|   5   | 2015-07-23 |  8492.0 |   833.0   |
|   2   | 2015-07-19 |  8565.0 |   687.0   |
|   10  | 2015-07-09 |  7185.0 |   681.0   |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]

Given a second store SFrame (the "Store" column is unique in this SFrame):

+-------+-----------+
| Store | StoreType |
+-------+-----------+
|   1   |     c     |
|   2   |     a     |
|   3   |     a     |
|   4   |     c     |
|   5   |     a     |
|   6   |     a     |
|   7   |     a     |
|   8   |     a     |
|   9   |     a     |
|   10  |     a     |
+-------+-----------+

I could attach the appropriate StoreType to my train SFrame by going through each row in the train and find the appropriate StoreType from store and then keep the column and ise SFrame.add_column() afterwards:

store_type_col = []
for row in train:
    row_store = row['Store']
    row_storetype = next(i for i in store if i['Store'] == row_store)['StoreType']
    store_type_col.append(row_storetype)

train.add_column(graphlab.SArray(store_type_col, dtype=str), name='StoreType')

To get:

+-------+------------+---------+-----------+-----------+
| Store |    Date    |  Sales  | Customers | StoreType |
+-------+------------+---------+-----------+-----------+
|   1   | 2015-07-31 |  5263.0 |   555.0   |   c 
|   2   | 2015-07-31 |  6064.0 |   625.0   |   a 
|   3   | 2015-07-31 |  8314.0 |   821.0   |   a
|   4   | 2015-07-31 | 13995.0 |   1498.0  |   c
|   3   | 2015-07-20 |  4822.0 |   559.0   |   a
|   2   | 2015-07-10 |  5651.0 |   589.0   |   a
|   4   | 2015-07-11 | 15344.0 |   1414.0  |   c
|   5   | 2015-07-23 |  8492.0 |   833.0   |   a
|   2   | 2015-07-19 |  8565.0 |   687.0   |   a
|   10  | 2015-07-09 |  7185.0 |   681.0   |   a
+-------+------------+---------+-----------+-----------+
[986159 rows x 5 columns]

But I'm sure there's a simpler and faster way to do this using Graphlab. The current method has a worst case of O(n*m) where n = no. of rows in train and m = no. of rows in m.

Imagine that my store SFrame has 8 columns which I want to append to train. The code above would be super inefficient.

How else can I appending data column from one SFrame to another SFrame? (Pandas solutions are welcomed too)


Solution

  • You could do this using a join operation.

    out = train.join(store, on = 'Store')