Search code examples
rggplot2binning

how to bin numerical data when its mixed with character


Let's say I have 3 columns of data in R:

  1. Type: 'A' 'B' 'C' 'D' 'E' 'F' 'G'
  2. Value: UT 30 45 50 62 70 72
  3. Efficiency: 70 72 80 88 90 92 98

I want to bin just the 'numerical' data in the 'Value' column in incremement of 20 and display that on X-axis, while leaving the 'text' value in place [so the x-axis will display: UT 30-50 50-70 70-90], while showing 'Efficiency' on the Y-axis, with color = type.

Binning numerical datatype seems straight forward: bins <- seq(30, 80, by = 20) then plotting, but having that 'UT' is giving me a real challenge.

I'm a noob; just played around

#create bins
bins <- seq(30, 90, by = 20) 

#create plot 
ggplot(df, aes(x = cut(`Value`, breaks = bins, labels = sprintf("%d-%d", bins[-length(bins)], bins[-1])), y = 'Efficiency', color = `Type`)) 

Solution

  • df <- data.frame(
      Type = c("A", "B", "C", "D", "E", "F", "G"),
      Value = c("UT", "30", "45", "50", "62", "70", "72"),
      Efficiency = c(70, 72, 80, 88, 90, 92, 98)
    )
    

    Take a step back and consider the organization of your dataset. Columns in dataframes are vectors of a single type, so by including "UT" in that column, this is a character column:

    dplyr::glimpse(df)
    
    Rows: 7
    Columns: 3
    $ Type       <chr> "A", "B", "C", "D", "E", "F", "G"
    $ Value      <chr> "UT", "30", "45", "50", "62", "70", "72"
    $ Efficiency <dbl> 70, 72, 80, 88, 90, 92, 98
    

    Your Value column appears to originate from numerical data, but you have that pesky UT. In this case, I think it's better to get more detailed with the dataframe so you can more precisely describe the situation:

    # New column to give context to the value column
    df$Value_type <- ifelse(df$Value == "UT", "UT", NA)
    df$Value[df$Value == "UT"] <- NA
    df$Value <- as.numeric(df$Value)
    
    # Categories for the Value column
    df <- df |>
      dplyr::mutate(Value_cat = dplyr::case_when(
        30 <= Value & Value < 50 ~ "30-50",
        50 <= Value & Value < 70 ~ "50-70",
        70 <= Value & Value < 90 ~ "70-90",
        Value_type == "UT" ~ "UT",
        .default = NA
      ))
    
    # Set factor levels so that any plots have desired order 
    df$Value_cat <- factor(df$Value_cat, levels = c(
      "UT", "30-50", "50-70", "70-90"
    ))
    
    dplyr::glimpse(df)
    
    Rows: 7
    Columns: 5
    $ Type       <chr> "A", "B", "C", "D", "E", "F", "G"
    $ Value      <dbl> NA, 30, 45, 50, 62, 70, 72
    $ Efficiency <dbl> 70, 72, 80, 88, 90, 92, 98
    $ Value_type <chr> "UT", NA, NA, NA, NA, NA, NA
    $ Value_cat  <fct> UT, 30-50, 30-50, 50-70, 50-70, 70-90, 70-90
    

    Now we're a bit closer to the fabled "tidy" dataset: Where rows are each individual observations, and each observation has characteristics that are well-defined in separate variables / columns. You also get the convenience of accessing your Value measurements in a numerical vector. It's generally good to avoid casting numbers as strings when you need to write any sort of logic about their numerical properties.

    Since you have a category "UT", your x-axis is really categorical, rather than numeric. I suggest a barplot with "bins" created from these categories, rather than a histogram where you set bin width, as histograms are generally meant for continuous numerical data.

    library(ggplot2)
    # Since your `fill` is not 1:1 with your x-axis, you can set
    # the position of the bars in the position argument
    ggplot(data = df, mapping = aes(x = Value_cat, y = Efficiency)) +
      geom_col(aes(fill = Type), position = position_dodge())