My training data train
SFrame
looks like this with 4 columns (the "Store" column is non-unique in this SFrame):
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Given a second store
SFrame
(the "Store" column is unique in this SFrame):
+-------+-----------+
| Store | StoreType |
+-------+-----------+
| 1 | c |
| 2 | a |
| 3 | a |
| 4 | c |
| 5 | a |
| 6 | a |
| 7 | a |
| 8 | a |
| 9 | a |
| 10 | a |
+-------+-----------+
I could attach the appropriate StoreType
to my train
SFrame
by going through each row in the train
and find the appropriate StoreType
from store
and then keep the column and ise SFrame.add_column()
afterwards:
store_type_col = []
for row in train:
row_store = row['Store']
row_storetype = next(i for i in store if i['Store'] == row_store)['StoreType']
store_type_col.append(row_storetype)
train.add_column(graphlab.SArray(store_type_col, dtype=str), name='StoreType')
To get:
+-------+------------+---------+-----------+-----------+
| Store | Date | Sales | Customers | StoreType |
+-------+------------+---------+-----------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 | c
| 2 | 2015-07-31 | 6064.0 | 625.0 | a
| 3 | 2015-07-31 | 8314.0 | 821.0 | a
| 4 | 2015-07-31 | 13995.0 | 1498.0 | c
| 3 | 2015-07-20 | 4822.0 | 559.0 | a
| 2 | 2015-07-10 | 5651.0 | 589.0 | a
| 4 | 2015-07-11 | 15344.0 | 1414.0 | c
| 5 | 2015-07-23 | 8492.0 | 833.0 | a
| 2 | 2015-07-19 | 8565.0 | 687.0 | a
| 10 | 2015-07-09 | 7185.0 | 681.0 | a
+-------+------------+---------+-----------+-----------+
[986159 rows x 5 columns]
But I'm sure there's a simpler and faster way to do this using Graphlab
. The current method has a worst case of O(n*m)
where n = no. of rows in train
and m = no. of rows in m
.
Imagine that my store
SFrame
has 8 columns which I want to append to train
. The code above would be super inefficient.
How else can I appending data column from one SFrame to another SFrame? (Pandas solutions are welcomed too)
You could do this using a join
operation.
out = train.join(store, on = 'Store')