Search code examples
pythonpandasdataframefor-loopindex-error

How to extract the rows of a dataframe where a combination of specified column values are duplicated?


Say I have the following dataframe:

import pandas as pd
data = {'Year':[2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018],
        'Month':[1,1,1,2,2,3,3,3],
        'ID':['A', 'A', 'B', 'A', 'B', 'A', 'B', 'B'],
        'Fruit':['Apple', 'Banana', 'Apple', 'Pear', 'Mango', 'Banana', 'Apple', 'Mango']}
df = pd.DataFrame(data, columns=['Year', 'Month', 'ID', 'Fruit'])
df = df.astype(str)
df

enter image description here

I want to extract the combination of 'Year', 'Month' and 'ID' that are repeated. So, with the above dataframe, the expected result is this dataframe:

enter image description here

My approach to do this is to first do a groupby to calculate the number of times the combination of Year, Month and ID appear:

df2 = df.groupby(['Year', 'Month'])['ID'].value_counts().to_frame(name = 'Count').reset_index()
df2 = df2[df2.Count>1]
df2

enter image description here

And then, my idea was to iterate through the Year, Month and ID combinations in the groupby dataframe, and extract those rows that match the combinations in the original dataframe into a new dataframe:

df_new = pd.DataFrame(columns=df.columns, index=range(sum(df2.Count)))

count = 0
for i in df2.index:
    temp = df[(df.ID==df2.ID[i]) & (df.Year==df2.Year[i]) & (df.Month==df2.Month[i])]
    temp.reset_index(drop=True, inplace=True)
    for j in range(len(temp)):
        df_new.iloc[count] = temp.iloc[j]
        count+=1
df_new

But this gives the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-38-7f2d95d71270> in <module>()
      6     temp.reset_index(drop=True, inplace=True)
      7     for j in range(len(temp)):
----> 8         df_new.iloc[count] = temp.iloc[j]
      9         count+=1
     10 df_new

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
    187         else:
    188             key = com.apply_if_callable(key, self.obj)
--> 189         indexer = self._get_setitem_indexer(key)
    190         self._setitem_with_indexer(indexer, value)
    191 

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\pandas\core\indexing.py in _get_setitem_indexer(self, key)
    173 
    174         try:
--> 175             return self._convert_to_indexer(key, is_setter=True)
    176         except TypeError as e:
    177 

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   2245 
   2246         try:
-> 2247             self._validate_key(obj, axis)
   2248             return obj
   2249         except ValueError:

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   2068             return
   2069         elif is_integer(key):
-> 2070             self._validate_integer(key, axis)
   2071         elif isinstance(key, tuple):
   2072             # a tuple should already have been caught by this point

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\pandas\core\indexing.py in _validate_integer(self, key, axis)
   2137         len_axis = len(self.obj._get_axis(axis))
   2138         if key >= len_axis or key < -len_axis:
-> 2139             raise IndexError("single positional indexer is out-of-bounds")
   2140 
   2141     def _getitem_tuple(self, tup):

IndexError: single positional indexer is out-of-bounds

What's the error? I am not able to figure out.

The error goes away when I change the contents of the for loop to the following, which produces the desired result:

for j in range(len(temp)):
    df_new.ID[count] = temp.ID[j]
    df_new.Year[count] = temp.Year[j]
    df_new.Month[count] = temp.Month[j]
    df_new.Fruit[count] = temp.Fruit[j]
    count+=1

But this is a tedious workaround that involves writing n lines for each of the n columns in the original dataframe.


Solution

  • Use GroupBy.transform with any column and counts by GroupBy.size for Series with same size like original, so possible filter by boolean indexing:

    df1 = df[df.groupby(['Year','Month','ID'])['ID'].transform('size') > 1]
    

    Or if small DataFrame or performance is not important use DataFrameGroupBy.filter:

    df1 = df.groupby(['Year','Month','ID']).filter(lambda x: len(x) > 1)
    

    print (df1)
    
       Year  Month ID   Fruit
    0  2018      1  A   Apple
    1  2018      1  A  Banana
    6  2018      3  B   Apple
    7  2018      3  B   Mango