Search code examples
pythonjoinpandasdataframegraphlab

Merging median from one column by a key column - SFrame / Pandas


In graphlab, I have the following SFrame call train:

import graphlab
train = graphlab.read_csv('clean_train.csv')
train.head()

[out]:

+-------+------------+---------+-----------+
| Store |    Date    |  Sales  | Customers |
+-------+------------+---------+-----------+
|   1   | 2015-07-31 |  5263.0 |   555.0   |
|   2   | 2015-07-31 |  6064.0 |   625.0   |
|   3   | 2015-07-31 |  8314.0 |   821.0   |
|   4   | 2015-07-31 | 13995.0 |   1498.0  |
|   3   | 2015-07-20 |  4822.0 |   559.0   |
|   2   | 2015-07-10 |  5651.0 |   589.0   |
|   4   | 2015-07-11 | 15344.0 |   1414.0  |
|   5   | 2015-07-23 |  8492.0 |   833.0   |
|   2   | 2015-07-19 |  8565.0 |   687.0   |
|   10  | 2015-07-09 |  7185.0 |   681.0   |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]

To get the median Sales per store, I can do the following to attach a new column for the median sales per store using graphlab:

mediansales_perstore = train.groupby('Store', operations={'mediansales': agg.QUANTILE('Sales', 0.5)})
train_stores = train_stores.join(mediansales_perstore, on='Store')
test_stores['mediansales'] = [i[0] for i in test_stores['mediansales']]

The code works in graphlab that addeds a new row mediansales. But when I try to use pandas DataFrame with the code:

mediansales_perstore = train.groupby(['Store'])['Sales'].median()

This extracts the median sales per store as per the graphlab code, but when i try to merge it back to the train matrix with:

train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')

It fails and throws the error:

ValueError                                Traceback (most recent call last)
<ipython-input-15-7b64cb46e386> in <module>()
----> 1 train.join(pd.DataFrame(train.groupby(['Store'])['Sales'].median()), on='Store')

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4017         # For SparseDataFrame's benefit
   4018         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4019                                  rsuffix=rsuffix, sort=sort)
   4020 
   4021     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4031             return merge(self, other, left_on=on, how=how,
   4032                          left_index=on is None, right_index=True,
-> 4033                          suffixes=(lsuffix, rsuffix), sort=sort)
   4034         else:
   4035             if on is not None:

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.pyc in get_result(self)
    190 
    191         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 192                                                      rdata.items, rsuf)
    193 
    194         lindexers = {1: left_indexer} if left_indexer is not None else {}

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   3969         if not lsuffix and not rsuffix:
   3970             raise ValueError('columns overlap but no suffix specified: %s' %
-> 3971                              to_rename)
   3972 
   3973         def lrenamer(x):

ValueError: columns overlap but no suffix specified: Index([u'Sales'], dtype='object')

How could I merge the median of the "Sales" column using "Store" as the key using pandas? The graphlab code works though.


Solution

  • You may do this in one-stage using transform:

    >>> train['Median-Sales'] = train.groupby('Store')['Sales'].transform('median')
    >>> train
       Store        Date  Sales  Customers  Median-Sales
    0      1  2015-07-31   5263        555        5263.0
    1      2  2015-07-31   6064        625        6064.0
    2      3  2015-07-31   8314        821        6568.0
    3      4  2015-07-31  13995       1498       14669.5
    4      3  2015-07-20   4822        559        6568.0
    5      2  2015-07-10   5651        589        6064.0
    6      4  2015-07-11  15344       1414       14669.5
    7      5  2015-07-23   8492        833        8492.0
    8      2  2015-07-19   8565        687        6064.0
    9     10  2015-07-09   7185        681        7185.0
    

    the merge error simply says that you have duplicated column names across left and right frame, so either you need to provide suffixes to distinguish columns or rename the columns:

    >>> right = train.groupby('Store')['Sales'].median()
    >>> right.name = 'Median-Sales'
    >>> train.join(right, on='Store')
       Store        Date  Sales  Customers  Median-Sales
    0      1  2015-07-31   5263        555        5263.0
    1      2  2015-07-31   6064        625        6064.0
    2      3  2015-07-31   8314        821        6568.0
    3      4  2015-07-31  13995       1498       14669.5
    4      3  2015-07-20   4822        559        6568.0
    5      2  2015-07-10   5651        589        6064.0
    6      4  2015-07-11  15344       1414       14669.5
    7      5  2015-07-23   8492        833        8492.0
    8      2  2015-07-19   8565        687        6064.0
    9     10  2015-07-09   7185        681        7185.0