Search code examples
rfunctionreferencedata-cleaning

Delete rows containing certain IDs in one dataset by using IDs identified in another dataset (in R)


I have a dataset of Chatrooms with people sending messages and a bigger dataset containing both these people and everyone else who did something besides go to a Chatroom.

I need to identify the people who were alone in the Chatrooms (in the chatroom dataset)—i.e., anyone in a chatroom with messages from only one person—and remove them from the bigger dataset.

So from the dataset below, I'd need to identify Persons A, J, and M and then remove those Persons from the bigger dataset (which also contains Persons A through Z and lots of other non-chat data).

data.table(Chatroom = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8), Person = c("A","A", "B","C","D","E","F","G","H","I","J","J","J","K","L","M", "M"), Message = c("Hi", "You there?", "Hello", "Hi", "Hey", "Howdy", "Hi", "Hey", "Greetings", "Hi", "Hi", "Hello?", "Anyone there?", "Hey", "Hi", "Hello?", "Helllooooooo?"))

    Chatroom Person       Message
 1:        1      A            Hi
 2:        1      A    You there?
 3:        2      B         Hello
 4:        2      C            Hi
 5:        3      D           Hey
 6:        3      E         Howdy
 7:        4      F            Hi
 8:        4      G           Hey
 9:        5      H     Greetings
10:        5      I            Hi
11:        6      J            Hi
12:        6      J        Hello?
13:        6      J Anyone there?
14:        7      K           Hey
15:        7      L            Hi
16:        8      M        Hello?
17:        8      M Helllooooooo?

What R functions would help me do this?

I know how to create a list manually, but not from existing data based on criteria like what I mention above. I have used functions like mutate, glue, arrange, and drop_na, but I am struggling to imagine how to put these together to reproducibly filter the bigger dataset using the Chatroom subset.

Advice?


Solution

  • You can create a small datatable to_remove like this:

    to_remove = df[,ct:=uniqueN(Person), Chatroom][ct==1, .(Person=unique(Person))]
    

    And then anti-join on your other data.table (say df_big), like this:

    df_big[!to_remove, on="Person"]
    

    I'm using data.table, since that is how you've provided df, but if you want an approach using dplyr, you can do this:

    anti_join(
      df_big,
      df %>% group_by(Chatroom) %>% distinct(Person) %>% filter(n()==1),
      by = "Person"
    )