Search code examples
rfiltergroup-byaggregatedistinct-values

Filter multiple occurrences based on group


I have a dataset like mentioned below:

df=data.frame(Supplier_id=c("1","2","7","7","7","4","5","8","12","7"), Supplier=c("Tian","Yan","Goldy","Goldy","Goldy","Amy","Lauren","Cassy","Shaan","Goldy"),Date=c("1/17/2019","4/30/2019","11/29/2018","11/29/2018","11/29/2018","5/21/2018","5/23/2018","5/24/2018","6/15/2018","6/20/2018"),Buyer=c("Unclassified","Unclassified","Kelly","Kelly","Kelly","Kelly","Amanda","Echo","Shao","Shao"))

df$Supplier_id=as.numeric(as.character(df$Supplier_id))

Thus, df appears like below:


| Supplier_id | Supplier | Date       | Buyer        |
|-------------|----------|------------|--------------|
| 1           | Tian     | 1/17/2019  | Unclassified |
| 2           | Yan      | 4/30/2019  | Unclassified |
| 7           | Goldy    | 11/29/2018 | Kelly        |
| 7           | Goldy    | 11/29/2018 | Kelly        |
| 7           | Goldy    | 11/29/2018 | Kelly        |
| 4           | Amy      | 5/21/2018  | Kelly        |
| 5           | Lauren   | 5/23/2018  | Amanda       |
| 8           | Cassy    | 5/24/2018  | Echo         |
| 12          | Shaan    | 6/15/2018  | Shao         |
| 7           | Goldy    | 6/20/2018  | Shao         |

Now, I want to filter out the Supplier_id's that occur only once for each unique Buyer. For example, in the above dataset, Supplier_id '1' and '2' belong to 'unclassified' buyer, but because they have different ids, I do not want them in my final output. However, when we look at the buyer 'Kelly', it has two supplier_ids, '7' and '4', where, '7' is occurring 3 times and '4' only once. So, the output table should have the record with supplier_id='7'. The grouping should be based on 'Buyer'. So it is important to note that since the supplier_id '7' exists for both 'Kelly' and 'Shao', but it should be grouped differently for both these buyers and not considered together.

The expected output should be:

| Supplier_id | Supplier |       Date | Buyer_id |
|-------------|:--------:|-----------:|----------|
| 7           |   Goldy  | 11/29/2018 | Kelly    |
| 7           |   Goldy  | 11/29/2018 | Kelly    |
| 7           |   Goldy  | 11/29/2018 | Kelly    |

I have tried using group_by and filter but this would not work because there will be distinct supplier_id's for every buyer.I have also tried using duplicate but not sure how can I group the supplier_id for each buyer.

df <-df %>% group_by(Buyer) %>% filter(Supplier_id>1)

and also this

df2=df[duplicated(df[1]) | duplicated(df[1], fromLast=TRUE),]

EDIT: The original dataset has many such instances and there are n occurrences of different supplier_id for each buyer. What could be other way to get the desired output?


Solution

  • I think you need -

    df %>% group_by(Supplier_id, Buyer) %>% filter(n() > 1)