I am currently working on a machine learning project. I have a large dataset that was scraped off the forum www.stormfront.com. The dataset has 7 columns: stormfront_self_content (forum posts), stormfront_lang_id, stormfront_publication_date, stormfront_topic, stormfront_docid, stormfront_category, stormfront_user.
I want to select a set of users that have been registered on the forum for more than one year and that have written more than 500 posts, but I am not sure how to do that.
Any help would be greatly appreciated.
Assuming you have some id
column which represents each user, we can group_by
each id
select groups which have more than 500 rows and the number of days between max
and min
time between their publication date is greater than 365.
library(dplyr)
library(lubridate)
df %>%
mutate(stormfront_publication_date = ymd_hms(stormfront_publication_date)) %>%
group_by(id) %>%
filter(n() > 500 & difftime(max(stormfront_publication_date),
min(stormfront_publication_date),units = 'days') > 365)