Search code examples
pythonpandaspandas-groupbypandas-apply

Unexplained behavior with Pandas Split (group) + Apply + Rejoin (concat), but only when sorting


I'm trying to calculate distances between a column and its lag (shift) for groups in a Pandas dataframe. The groups need to be sorted so that the shift is one timestep before. The standard way to do this is by .groupby() (aka Split), then .apply() with the distance function over each group, then rejoin with .concat(). This works fine, but only when I don't explicitly sort the grouped dataframe. when I sort the grouped dataframe, I get an error in the rejoining step.

Here's my example code, for which I was able to reproduce the unexpected behavior:

import pandas as pd

def dist_apply(group):

    # when commented out, this code will run to completion (!)
    group.sort_values(by='T',inplace=True)

    group['shift'] = group['Y'].shift()
    group['dist'] = group['Y'] - group['shift']
    return group

df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)

# split
df_g = df.groupby(['X'])
# apply
df_g = df_g.apply(dist_apply)
print(df_g)

# rejoin
df = pd.concat([df,df_g],axis=1)
print(df)

When the code that sorts the grouped dataframe is commented out, then the code prints this, which is expected:

   X    T  Y
0  A  0.9  7
1  B  0.8  1
2  A  0.7  8
3  B  0.9  3
4  A  0.8  9
5  B  0.7  5

   X    T  Y  shift  dist
0  A  0.9  7    NaN   NaN
1  B  0.8  1    NaN   NaN
2  A  0.7  8    7.0   1.0
3  B  0.9  3    1.0   2.0
4  A  0.8  9    8.0   1.0
5  B  0.7  5    3.0   2.0

   X    T  Y  X    T  Y  shift  dist
0  A  0.9  7  A  0.9  7    NaN   NaN
1  B  0.8  1  B  0.8  1    NaN   NaN
2  A  0.7  8  A  0.7  8    7.0   1.0
3  B  0.9  3  B  0.9  3    1.0   2.0
4  A  0.8  9  A  0.8  9    8.0   1.0
5  B  0.7  5  B  0.7  5    3.0   2.0

With the sorting line, the Traceback looks like this:

Traceback (most recent call last):
  File "test.py", line 19, in <module>
    df = pd.concat([df,df_g],axis=1)
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
    return op.get_result()
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 420, in get_result
    indexers[ax] = obj_labels.reindex(new_labels)[1]
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2236, in reindex
    target = MultiIndex.from_tuples(target)
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 396, in from_tuples
    arrays = list(lib.tuples_to_object_array(tuples).T)
  File "pandas/_libs/lib.pyx", line 2287, in pandas._libs.lib.tuples_to_object_array
TypeError: object of type 'int' has no len()

Sorting but not running the concat prints me this for df_g:

     X    T  Y  shift  dist
X                          
A 2  A  0.7  8    NaN   NaN
  4  A  0.8  9    8.0   1.0
  0  A  0.9  7    9.0  -2.0
B 5  B  0.7  5    NaN   NaN
  1  B  0.8  1    5.0  -4.0
  3  B  0.9  3    1.0   2.0

which shows that it's grouped differently than the printing of df_g without the sorting (above), but it's not clear how the concat is breaking in this case.


update: I thought I had solved it by renaming the offending column ('X' in this case) and also using .reset_index() on the grouped dataframe before the merge.

df_g.columns = ['X_g','T','Y','shift','dist']
df = pd.concat([df,df_g.reset_index()],axis=1)

runs as expected, and prints this:

   X    T  Y  X  level_1 X_g    T  Y  shift  dist
0  A  0.9  7  A        2   A  0.7  8    NaN   NaN
1  B  0.8  1  A        4   A  0.8  9    8.0   1.0
2  A  0.7  8  A        0   A  0.9  7    9.0  -2.0
3  B  0.9  3  B        5   B  0.7  5    NaN   NaN
4  A  0.8  9  B        1   B  0.8  1    5.0  -4.0
5  B  0.7  5  B        3   B  0.9  3    1.0   2.0

But looking closely, this column shows that the merge is done incorrectly:

    1  B  0.8  1  A        4   A  0.8  9    8.0   1.0

I'm using Mac OSX with Python 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:05:27)

Pandas 0.24.2 + Numpy 1.17.3 and also tried upgrading to Pandas 0.25.3 and Numpy 1.17.5 with the same result.


Solution

  • This is tentatively working.

    Rename columns to avoid duplicate:

    df_g.columns = ['X_g','T','Y','shift','dist']
    

    Reset index to single from multiindex:

    df_g = df_g.reset_index(level=[0,1])
    

    Simple merge, put df_g first if you want to keep the sorted-group order:

    df = pd.merge(df_g,df)
    

    gives me

       X  level_1 X_g    T  Y  shift  dist
    0  A        2   A  0.7  8    NaN   NaN
    1  A        4   A  0.8  9    8.0   1.0
    2  A        0   A  0.9  7    9.0  -2.0
    3  B        5   B  0.7  5    NaN   NaN
    4  B        1   B  0.8  1    5.0  -4.0
    5  B        3   B  0.9  3    1.0   2.0
    

    Full code:

    import pandas as pd
    
    def dist_apply(group):
    
        group.sort_values(by='T',inplace=True)
    
        group['shift'] = group['Y'].shift()
        group['dist'] = group['Y'] - group['shift']
        return group
    
    df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
    print(df)
    df_g = df.groupby(['X'])
    
    df_g = df_g.apply(dist_apply)
    
    #print(df_g)
    
    df_g.columns = ['X_g','T','Y','shift','dist']
    df_g = df_g.reset_index(level=[0,1])
    
    #print(df_g)
    df = pd.merge(df_g,df)
    
    print(df)