Search code examples
rtwitterggplot2data-visualizationnormalization

Plotting normalized subset of data


I've got to do a line plot that consists of: x = hour of day, y = (normalized) number of tweets on that hour, considering only tweets from X month, Each line represents a month.

My dataframe is in this format (i've got more columns but they're not relevant for this):

id_tweet           day month hour minute id_user
550654742654103552  01   01   12    08   174744462
550654753106296832  01   01   12    08   15355832 
550654818935910400  01   01   12    08   628822209
550654823667089409  01   01   12    08   283218297
550654824308813824  01   01   12    09   58315346

I want to know how many percent of people tweet on January, or July, or anything like that.

The problem is that my data is very biased, there was a change in the collection algorithm and I've got a lot more data for months 1 ~ 4 then for the rest. My data distribution is shown on the image below:

Long story short, I need to sum all tweets that were tweeted at each hour of day and divide by the total number of tweets from January. That would be line 1 for the graph.

Line 2 would be all tweets that were tweeted at each hour of day and divide by the total number of tweets from February, and so on.

Hope I was clear and I thank in advance any help I can get.


Solution

  • You can use dplyr to aggregate your data:

    library(dplyr)
    agg_data = your_data %>% 
      group_by(month, day, hour) %>%
      summarize(n_hour = n()) %>%
      group_by(month) %>% 
      mutate(percent_of_month = n_hour / sum(n_hour))
    

    I'll leave the plotting to you.