In graphlab
, I have the following SFrame
call train
:
import graphlab
train = graphlab.read_csv('clean_train.csv')
train.head()
[out]:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
To get the median Sales per store, I can do the following to attach a new column for the median sales per store using graphlab
:
mediansales_perstore = train.groupby('Store', operations={'mediansales': agg.QUANTILE('Sales', 0.5)})
train_stores = train_stores.join(mediansales_perstore, on='Store')
test_stores['mediansales'] = [i[0] for i in test_stores['mediansales']]
The code works in graphlab
that addeds a new row mediansales
. But when I try to use pandas
DataFrame
with the code:
mediansales_perstore = train.groupby(['Store'])['Sales'].median()
This extracts the median sales per store as per the graphlab code, but when i try to merge it back to the train matrix with:
train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')
It fails and throws the error:
ValueError Traceback (most recent call last)
<ipython-input-15-7b64cb46e386> in <module>()
----> 1 train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
4017 # For SparseDataFrame's benefit
4018 return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4019 rsuffix=rsuffix, sort=sort)
4020
4021 def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
4031 return merge(self, other, left_on=on, how=how,
4032 left_index=on is None, right_index=True,
-> 4033 suffixes=(lsuffix, rsuffix), sort=sort)
4034 else:
4035 if on is not None:
/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in get_result(self)
190
191 llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 192 rdata.items, rsuf)
193
194 lindexers = {1: left_indexer} if left_indexer is not None else {}
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
3969 if not lsuffix and not rsuffix:
3970 raise ValueError('columns overlap but no suffix specified: %s' %
-> 3971 to_rename)
3972
3973 def lrenamer(x):
ValueError: columns overlap but no suffix specified: Index([u'Sales'], dtype='object')
How could I merge the median of the "Sales" column using "Store" as the key using pandas
? The graphlab
code works though.
You may do this in one-stage using transform
:
>>> train['Median-Sales'] = train.groupby('Store')['Sales'].transform('median')
>>> train
Store Date Sales Customers Median-Sales
0 1 2015-07-31 5263 555 5263.0
1 2 2015-07-31 6064 625 6064.0
2 3 2015-07-31 8314 821 6568.0
3 4 2015-07-31 13995 1498 14669.5
4 3 2015-07-20 4822 559 6568.0
5 2 2015-07-10 5651 589 6064.0
6 4 2015-07-11 15344 1414 14669.5
7 5 2015-07-23 8492 833 8492.0
8 2 2015-07-19 8565 687 6064.0
9 10 2015-07-09 7185 681 7185.0
the merge error simply says that you have duplicated column names across left and right frame, so either you need to provide suffixes to distinguish columns or rename the columns:
>>> right = train.groupby('Store')['Sales'].median()
>>> right.name = 'Median-Sales'
>>> train.join(right, on='Store')
Store Date Sales Customers Median-Sales
0 1 2015-07-31 5263 555 5263.0
1 2 2015-07-31 6064 625 6064.0
2 3 2015-07-31 8314 821 6568.0
3 4 2015-07-31 13995 1498 14669.5
4 3 2015-07-20 4822 559 6568.0
5 2 2015-07-10 5651 589 6064.0
6 4 2015-07-11 15344 1414 14669.5
7 5 2015-07-23 8492 833 8492.0
8 2 2015-07-19 8565 687 6064.0
9 10 2015-07-09 7185 681 7185.0