I got stuck in my project and would be very grateful for the help.
My goal is to explore the relationship between type (A, B, or C) and total income. I want to plot the income in a histogram and fill in the color by type.
My original data looked like this:
ID | year | income | type |
---|---|---|---|
x1 | 2015 | 300 | A |
x1 | 2015 | 700 | C |
x1 | 2016 | 1000 | A |
x1 | 2016 | 90 | B |
x1 | 2016 | 100 | B |
x2 | 2015 | 2000 | A |
x2 | 2015 | 150 | B |
x2 | 2015 | 500 | C |
x2 | 2015 | 45 | C |
x2 | 2016 | 100 | B |
x3 | 2015 | 111 | C |
In this case, by plotting the income on the x-axis and using aes(fill = type), the colors fill properly. See the histogram here
h <- ggplot(data, aes(fill=type,x=income))
h+geom_histogram()
However, while using the first table the data on the actual personal income for that year is lost, because when I draw a histogram, each line is treated as a different individual. For example, x1 individual income in 2015 is attributed to 300 and 700 bins even though his total income is 1000 on that year. So after summing up the income received and the types used, I get the following table:
ID | year | income_sum | typeA | typeB | typeC |
---|---|---|---|---|---|
x1 | 2015 | 1000 | 1 | 0 | 1 |
x1 | 2016 | 1190 | 1 | 2 | 0 |
x2 | 2015 | 2695 | 1 | 1 | 2 |
x2 | 2016 | 100 | 0 | 1 | 0 |
x3 | 2015 | 111 | 0 | 0 | 1 |
h <- ggplot(data2, aes(x=income_sum))
h+geom_histogram()
This time, the histogram can accurately represent total income, but fails to fill in three different colors by type (A, B, C). See the histogram here.
Does anyone have any ideas on how to solve this problem?
Do you want something like the following one?
library(dplyr)
data %>% group_by(ID,year) %>% summarize(income=sum(income), type=unique(type)) %>%
ggplot(aes(fill=type,x=income)) + geom_histogram()
Note that after the group_by
you have the following tibble:
ID year income type
<chr> <chr> <int> <chr>
1 x1 2015 1000 A
2 x1 2015 1000 C
3 x1 2016 1190 A
4 x1 2016 1190 B
5 x2 2015 2695 A
6 x2 2015 2695 B
7 x2 2015 2695 C
8 x2 2016 100 B
9 x3 2015 111 C
[EDIT]
If you want the bar heights to be proportionate to the number of times they appear, the following should work:
df <- data %>% group_by(ID, year, type) %>%
summarise(income=sum(income), count = n()) %>%
group_by(ID,year) %>%
summarize(income=sum(income), type=type, count=count)
df
# A tibble: 9 x 5
# Groups: ID, year [5]
ID year income type count
<chr> <chr> <int> <chr> <int>
1 x1 2015 1000 A 1
2 x1 2015 1000 C 1
3 x1 2016 1190 A 1
4 x1 2016 1190 B 2
5 x2 2015 2695 A 1
6 x2 2015 2695 B 1
7 x2 2015 2695 C 2
8 x2 2016 100 B 1
9 x3 2015 111 C 1
df %>% ggplot(aes(fill=type, color=type, x=income, y=count)) +
geom_bar(stat='identity', width = 50, alpha=0.5)
Note that there is one difference. Since the values 100
and 111
are not exactly same (unlike the others), the bars corresponding to B
and C
at these values are not stacked on top of one another, rather they are overlapped (one starts at 100 and another at 111).
[EDIT2]
We need binning additionally to achieve what you want (change binwidth if needed, currently it's set at 50),
bins <- seq(min(df$income), max(df$income), 50)
df$bin <- sapply(df$income, function(x) max(which(bins <= x)))
df <- df %>% group_by(bin) %>%
mutate(income = mean(income), bin=bin)
df
ID year income type count bin
<chr> <chr> <dbl> <chr> <int> <int>
1 x1 2015 1000 A 1 19
2 x1 2015 1000 C 1 19
3 x1 2016 1190 A 1 22
4 x1 2016 1190 B 2 22
5 x2 2015 2695 A 1 52
6 x2 2015 2695 B 1 52
7 x2 2015 2695 C 2 52
8 x2 2016 106. B 1 1
9 x3 2015 106. C 1 1
df %>%
ggplot(aes(fill=type, color=type, x=income, y=count)) +
geom_bar(stat='identity', width = 50)
Note that the income for a bin is set to the average of the datapoints present inside the bin.