Search code examples
rfunctiondata-cleaning

How to eliminate all rows in a set or rows containing only one entity (in R)


I have a dataset of chat messages from chatrooms. I need to filter out all chatrooms in which only one person wrote something in the chat (even if that person wrote multiple things). So in the example dataset below, I need to eliminate Chatrooms 1, 6, and 8.

data.table(Chatroom = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8), Person = c("A","A", "B","C","D","E","F","G","H","I","J","J","J","K","L","M", "M"), Message = c("Hi", "You there?", "Hello", "Hi", "Hey", "Howdy", "Hi", "Hey", "Greetings", "Hi", "Hi", "Hello?", "Anyone there?", "Hey", "Hi", "Hello?", "Helllooooooo?"))

    Chatroom Person       Message
 1:        1      A            Hi
 2:        1      A    You there?
 3:        2      B         Hello
 4:        2      C            Hi
 5:        3      D           Hey
 6:        3      E         Howdy
 7:        4      F            Hi
 8:        4      G           Hey
 9:        5      H     Greetings
10:        5      I            Hi
11:        6      J            Hi
12:        6      J        Hello?
13:        6      J Anyone there?
14:        7      K           Hey
15:        7      L            Hi
16:        8      M        Hello?
17:        8      M Helllooooooo?

Obviously, this can be done manually, but I've tons of data to filter.

Is there a way to do this with one or more scripts in R?

I am imagining needing a script that can identify and save the list of chatrooms that contain only one person and then another script to remove the Chatrooms from that list, but I don't know which functions can accomplish this.

Help?


Solution

  • There are a number of options. My first try was to use uniqueN(Person)>1 in .SD, by Chatroom:

    df[, .SD[uniqueN(Person)>1], Chatroom]
    

    Some possibly slightly faster options:

    df[, ct:=uniqueN(Person), Chatroom][ct>1][,ct:=NULL]
    

    OR

    df[, ct:=length(unique(Person)), Chatroom][ct>1][,ct:=NULL]
    

    OR

    df[, ct:=max(rleid(Person)), Chatroom][ct>1][,ct:=NULL]