I have a dataset of Chatrooms with people sending messages and a bigger dataset containing both these people and everyone else who did something besides go to a Chatroom.
I need to identify the people who were alone in the Chatrooms (in the chatroom dataset)—i.e., anyone in a chatroom with messages from only one person—and remove them from the bigger dataset.
So from the dataset below, I'd need to identify Persons A, J, and M and then remove those Persons from the bigger dataset (which also contains Persons A through Z and lots of other non-chat data).
data.table(Chatroom = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8), Person = c("A","A", "B","C","D","E","F","G","H","I","J","J","J","K","L","M", "M"), Message = c("Hi", "You there?", "Hello", "Hi", "Hey", "Howdy", "Hi", "Hey", "Greetings", "Hi", "Hi", "Hello?", "Anyone there?", "Hey", "Hi", "Hello?", "Helllooooooo?"))
Chatroom Person Message
1: 1 A Hi
2: 1 A You there?
3: 2 B Hello
4: 2 C Hi
5: 3 D Hey
6: 3 E Howdy
7: 4 F Hi
8: 4 G Hey
9: 5 H Greetings
10: 5 I Hi
11: 6 J Hi
12: 6 J Hello?
13: 6 J Anyone there?
14: 7 K Hey
15: 7 L Hi
16: 8 M Hello?
17: 8 M Helllooooooo?
What R functions would help me do this?
I know how to create a list manually, but not from existing data based on criteria like what I mention above. I have used functions like mutate, glue, arrange, and drop_na, but I am struggling to imagine how to put these together to reproducibly filter the bigger dataset using the Chatroom subset.
Advice?
You can create a small datatable to_remove
like this:
to_remove = df[,ct:=uniqueN(Person), Chatroom][ct==1, .(Person=unique(Person))]
And then anti-join on your other data.table (say df_big
), like this:
df_big[!to_remove, on="Person"]
I'm using data.table, since that is how you've provided df
, but if you want an approach using dplyr
, you can do this:
anti_join(
df_big,
df %>% group_by(Chatroom) %>% distinct(Person) %>% filter(n()==1),
by = "Person"
)