Search code examples
rdataframeggplot2histogramfill

How to fill histogram by different columns?


I got stuck in my project and would be very grateful for the help. My goal is to explore the relationship between type (A, B, or C) and total income. I want to plot the income in a histogram and fill in the color by type.
My original data looked like this:

ID year income type
x1 2015 300 A
x1 2015 700 C
x1 2016 1000 A
x1 2016 90 B
x1 2016 100 B
x2 2015 2000 A
x2 2015 150 B
x2 2015 500 C
x2 2015 45 C
x2 2016 100 B
x3 2015 111 C

In this case, by plotting the income on the x-axis and using aes(fill = type), the colors fill properly. See the histogram here

enter image description here

h <- ggplot(data, aes(fill=type,x=income))
h+geom_histogram()

However, while using the first table the data on the actual personal income for that year is lost, because when I draw a histogram, each line is treated as a different individual. For example, x1 individual income in 2015 is attributed to 300 and 700 bins even though his total income is 1000 on that year. So after summing up the income received and the types used, I get the following table:

ID year income_sum typeA typeB typeC
x1 2015 1000 1 0 1
x1 2016 1190 1 2 0
x2 2015 2695 1 1 2
x2 2016 100 0 1 0
x3 2015 111 0 0 1
h <- ggplot(data2, aes(x=income_sum))
h+geom_histogram()

This time, the histogram can accurately represent total income, but fails to fill in three different colors by type (A, B, C). See the histogram here.

enter image description here

Does anyone have any ideas on how to solve this problem?


Solution

  • Do you want something like the following one?

    library(dplyr)
    data %>% group_by(ID,year) %>% summarize(income=sum(income), type=unique(type)) %>%
    ggplot(aes(fill=type,x=income)) + geom_histogram()
    

    enter image description here Note that after the group_by you have the following tibble:

       ID    year  income type 
      <chr> <chr>  <int> <chr>
    1 x1    2015    1000 A    
    2 x1    2015    1000 C    
    3 x1    2016    1190 A    
    4 x1    2016    1190 B    
    5 x2    2015    2695 A    
    6 x2    2015    2695 B    
    7 x2    2015    2695 C    
    8 x2    2016     100 B    
    9 x3    2015     111 C  
    

    [EDIT]

    If you want the bar heights to be proportionate to the number of times they appear, the following should work:

    df <- data %>% group_by(ID, year, type) %>% 
                   summarise(income=sum(income), count = n()) %>% 
                   group_by(ID,year) %>% 
                   summarize(income=sum(income), type=type, count=count)
    df
    
    # A tibble: 9 x 5
    # Groups:   ID, year [5]
      ID    year  income type  count
      <chr> <chr>  <int> <chr> <int>
    1 x1    2015    1000 A         1
    2 x1    2015    1000 C         1
    3 x1    2016    1190 A         1
    4 x1    2016    1190 B         2
    5 x2    2015    2695 A         1
    6 x2    2015    2695 B         1
    7 x2    2015    2695 C         2
    8 x2    2016     100 B         1
    9 x3    2015     111 C         1
    
    df %>% ggplot(aes(fill=type, color=type, x=income, y=count)) + 
      geom_bar(stat='identity', width = 50, alpha=0.5)
    

    enter image description here

    Note that there is one difference. Since the values 100 and 111 are not exactly same (unlike the others), the bars corresponding to B and C at these values are not stacked on top of one another, rather they are overlapped (one starts at 100 and another at 111).

    [EDIT2]

    We need binning additionally to achieve what you want (change binwidth if needed, currently it's set at 50),

    bins <- seq(min(df$income), max(df$income), 50)
    df$bin <- sapply(df$income, function(x) max(which(bins <= x)))
    
    df <- df %>%  group_by(bin) %>%
      mutate(income = mean(income), bin=bin) 
    
    df
    
        ID    year  income type  count   bin
        <chr> <chr>  <dbl> <chr> <int> <int>
      1 x1    2015   1000  A         1    19
      2 x1    2015   1000  C         1    19
      3 x1    2016   1190  A         1    22
      4 x1    2016   1190  B         2    22
      5 x2    2015   2695  A         1    52
      6 x2    2015   2695  B         1    52
      7 x2    2015   2695  C         2    52
      8 x2    2016    106. B         1     1
      9 x3    2015    106. C         1     1
    
    df %>% 
      ggplot(aes(fill=type, color=type, x=income, y=count)) + 
      geom_bar(stat='identity', width = 50)
    

    enter image description here

    Note that the income for a bin is set to the average of the datapoints present inside the bin.