Search code examples
runiquedata-cleaning

R remove duplicates completely across different groups


I have a dataset like the following:

enter image description here

R code to replicate the dataset:

mydata <- data.frame(Name =c('Alex','Brenda','Carl','Alex','Daniel',
'Einstein','Frodo','Alex','Brenda','Carl','Einstein','Frodo'),
 Product_Name = c('A','A','A','A','A','A','A','B','B','B','B','B'), 
Use = c(0,0,0,1,1,1,1,0,0,1,1,1))
mydata

This is a dataset from a survey of product usage. Name contains name of the user, Product_Name is the name of the product (Product A or Product B. In the real dataset, there are more than 2) and Use contains information whether the user uses the product (1 = yes, 0 = no).

Unfortunately some individuals selected both yes and no to questions regarding whether they use a product or not. I want to remove these individuals but only for the Product_Name in question. In the example, user Alex replied yes and no for product A:

enter image description here

I want to remove such individuals but I want to remove them only for the product concerned. Here I only want to remove Alex for Product A and leave Alex for Product B. This should be how I want the dataset to look like:

enter image description here

I know that I can remove duplicates using the unique package in R (https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html) but that would still leave one case of Alex in Product 1. I would also like to limit the search for unique names within each Product_Name (ie. only Product A or Product B and so on). Any help will be appreciated.

Please let me know if the question is not very clear. Thanks in advance.

FOLLOW UP QUESTION

Now suppose we have the following scenario:

mydata <- data.frame(Name =c('Alex','Brenda','Carl','Alex','Daniel',
'Einstein','Frodo','Alex','Brenda','Carl','Einstein','Frodo',
'Mary','Mary','Richard','Richard'),
 Product_Name = c('A','A','A','A','A','A','A','B','B','B','B',
 'B','C','C','C','C'), 
Use = c(0,0,0,1,1,1,1,0,0,1,1,1,0,0,1,1))

In addition to the above condition where if a person has use =0 and use = 1 then they are deleted I have an additional condition. If Use = 0 and we see multiple entries for same user, then we do not delete the observations. However,if Use = 1 and we see multiple instances of the same user then we delete them. For instance, in the figure below, I would like to keep the observations for Mary and drop the observations for Richard.

enter image description here

The final output that I would like to get would look something like this:

enter image description here

In this figure, note that I do not want to delete Mary since for both instances Use =0. However since Use = 1 for Richard, I would like to delete his observations.

enter image description here


Solution

  • Original Question

    library(dplyr)
    mydata %>%
            group_by(Product_Name, Name) %>%
            filter(length(Use) == 1)
    

    Follow-up Question

    library(dplyr)
    mydata %>%
            group_by(Product_Name, Name) %>%
            filter(length(Use) == 1 | (Use == 0 & n_distinct(Use) == 1))