Search code examples
ggplot2bar-chartcategorical-data

R ggplot compare similar (but not identical) columns of categorical data


I've got a dataset which looks like this:

platform twitter_context facebook_context  insta_context
Twitter Hashtag NA NA
Facebook NA Facebook Group NA
Instagram NA NA Public Figure
Instagram NA NA Hashtag
Facebook NA A friend NA
Twitter Someone I follow NA NA

… total of rows > 1600

What I would like to achieve is a bar chart which compares the frequency of the categories in those "_context" columns by "platfom".

I have used ggplot before to draw a bar chart that combines two variables. But here, the categories in those "_contexts" are similar, but not identical.

As each context column only applies to one platform, I tried to merge the three context columns in a new column using the mutate function. However, I failed to make it work properly: When I ran three mutate lines consecutively the NAs would always overwrite previous categories. I tried to solve this with if/else_if-conditions, to have only proper data pasted to the new column (and ignore those NAs). But this idea was doomed by my lack of syntactical understanding.

I suppose there must be a way to get this right, however, I couldn't do it. (Did I mention I am quite new to this?)

My intention was that I could then plot a chart using the new "all_contexts" column and split it up on the x axis by platform. (The labelling would still be a mess, but possibly that could be fixed by applying levels.)

A different approach I could imagine would be to have ggplot draw three independent bar charts which then would have to be manually standardized, unless there are ways to "concatenate" such somehow in a single plot.

Very likely this rookie problem has already been covered in a thread which I was unable to find. Can someone point me into the right direction? I appreciate your help!


Solution

  • There are number of ways to transform your data to prepare it for the plot that you want to create. One way is illustrated here, where we use pivot_longer() and remove rows that are NA, and then count the number of rows by platform and context

    library(dplyr)
    library(tidyr)
    
    ggdata <- df %>%
      pivot_longer(cols = ends_with('context'), names_to = "p", values_to = "context") %>% 
      filter(!is.na(context)) %>% 
      count(platform,context)
    

    Now, you can directly pass the frame as is to ggplot() using geom_col(), or you could add rows for the platform/context combinations that are not represented.

    Here is an example of the former approach:

    library(ggplot2)
    ggplot(ggdata, aes(platform, n, fill=context)) + geom_col(position = "dodge")