Search code examples
rnlpdata-manipulationanalysis

Data manipulation: Select users based on variables


I am currently working on a machine learning project. I have a large dataset that was scraped off the forum www.stormfront.com. The dataset has 7 columns: stormfront_self_content (forum posts), stormfront_lang_id, stormfront_publication_date, stormfront_topic, stormfront_docid, stormfront_category, stormfront_user.

I want to select a set of users that have been registered on the forum for more than one year and that have written more than 500 posts, but I am not sure how to do that.

Any help would be greatly appreciated.

Dataset example


Solution

  • Assuming you have some id column which represents each user, we can group_by each id select groups which have more than 500 rows and the number of days between max and min time between their publication date is greater than 365.

    library(dplyr)
    library(lubridate)
    
    df %>%
      mutate(stormfront_publication_date = ymd_hms(stormfront_publication_date)) %>%
      group_by(id) %>%
      filter(n() > 500 & difftime(max(stormfront_publication_date), 
                        min(stormfront_publication_date),units = 'days') > 365)