Search code examples

Data manipulation: Select users based on variables

I am currently working on a machine learning project. I have a large dataset that was scraped off the forum The dataset has 7 columns: stormfront_self_content (forum posts), stormfront_lang_id, stormfront_publication_date, stormfront_topic, stormfront_docid, stormfront_category, stormfront_user.

I want to select a set of users that have been registered on the forum for more than one year and that have written more than 500 posts, but I am not sure how to do that.

Any help would be greatly appreciated.

Dataset example


  • Assuming you have some id column which represents each user, we can group_by each id select groups which have more than 500 rows and the number of days between max and min time between their publication date is greater than 365.

    df %>%
      mutate(stormfront_publication_date = ymd_hms(stormfront_publication_date)) %>%
      group_by(id) %>%
      filter(n() > 500 & difftime(max(stormfront_publication_date), 
                        min(stormfront_publication_date),units = 'days') > 365)