Search code examples
pythonpandasdataframegroup-by

return a subset of dataframe based on a condition resulting from groupby python


I have a data frame like below:

date|point|agent
2023-10-02|A|agent1 
2023-10-02|A|agent2 
2023-10-05|B|agent3 
2023-10-05|B|agent2 
2023-10-02|C|agent1 
2023-10-02|C|agent2 
2023-10-02|C|agent3 

On each day at a specific point, there should be only two agents. There are cases that there are more than two, I want to return those I want to return the rows that have more than 2 agents.

I used groupby to first count:

df.groupby(['point','date'])['agent'].nunique()>2 

I can use

df['agent_count'] = df.groupby(['point','date'])['agent'].transform('nunique')

and then get the the rows that have more than 2. But is there another way, without having redundant data? I used loc and iloc, where and exh gives me lots of error. I am looking for an efficient way to return the rows without adding the counts to the dataframe. I did explored questions here for two hours but none of them were working.


Solution

  • You're on the right track. Here's how you can do it in one line.

    df[df.groupby(['point','date'])['agent'].transform('nunique')>2]
    
    
             date point   agent
    4  2023-10-02     C  agent1
    5  2023-10-02     C  agent2
    6  2023-10-02     C  agent3