Search code examples
rggplot2frequencyggplotlygeom-area

Calculate sum and frequency of two different columns with multiple variables and plot using area graph


I have a data which looks like this:

| Employee | Employee_id | Transaction_date | Expense_Type    | Attendees | Vendor       | Purpose                     | Amount |
|----------|:-----------:|-----------------:|-----------------|-----------|--------------|-----------------------------|--------|
| Nancy    |      1      |       12/27/2018 | Individual_Meal | NA        | Chiles       | Dinner in NYC               | 128    |
| David    |      2      |         9/9/2017 | Group_Meal      | Jess      | Renaissance  | External Business Meeting   | 600    |
| David    |      2      |         9/9/2017 | Group_Meal      | Peter     | Renaissance  | External Business Meeting   | 600    |
| David    | 2           | 9/9/2017         | Group_Meal      | David     | Renaissance  | External Business Meeting   | 600    |
| John     | 3           | 10/4/2017        | Group_Meal      | Mike      | Subway       | Lunch with Mike and Maximus | 130    |
| Mary     | 4           | 1/16/2019        | Group_Meal      | Carol     | Olive_Garden | summit with Intel           | 235    |
| Mary     | 4           | 1/16/2019        | Group_Meal      | Sonia     | Olive_Garden | summit with Intel           | 235    |
| Mary     | 4           | 1/16/2019        | Group_Meal      | James     | Olive_Garden | summit with Intel           | 235    |
| Mary     | 4           | 1/16/2019        | Group_Meal      | Mary      | Olive_Garden | summit with Intel           | 235    |
| John     | 3           | 10/4/2017        | Group_Meal      | Maximus   | Subway       | Lunch with Mike and Maximus | 130    |
| John     | 3           | 10/4/2017        | Group_Meal      | John      | Subway       | Lunch with Mike and Maximus | 130    |
| Richard  | 5           | 4/11/2018        | Individual_Meal | NA        | Dominos      | Dinner in Ohio              | 50     |

I want to aggregate the table in such a way that I can see the no of attendees for each employee and the total expense incurred for them. The final table should look something like this:

| Employee | Employee_id | Transaction_date | Expense_Type    | Vendor       | Purpose                     | No_of_Attendee | Total_Amount |
|----------|:-----------:|-----------------:|-----------------|--------------|-----------------------------|----------------|--------------|
| Nancy    |      1      |       12/27/2018 | Individual_Meal | Chiles       | Dinner in NYC               | 1              | 128          |
| David    |      2      |         9/9/2017 | Group_Meal      | Renaissance  | External Business Meeting   | 3              | 1800         |
| John     |      3      |        10/4/2017 | Group_Meal      | Subway       | Lunch with Mike and Maximus | 3              | 390          |
| Mary     | 4           | 1/16/2019        | Group_Meal      | Olive_Garden | summit with Intel           | 4              | 940          |
| Richard  | 5           | 4/11/2018        | Individual_Meal | Dominos      | Dinner in Ohio              | 1              | 50           |

Next, I want to generate an area plot where I have 'transaction date' on x axis and 'Amount' on y axis with different variables such as vendor, purpose mentioned in the tooltip. I I have tried some code but I'm not sure how to calculate frequency and sum of two different columns while retaining other columns as shown in the desired output table. Also, when I try to use text within ggplot2, the area graph comes fine until only employee is mentioned. As soon as I include vendor and/or purpose, the area graph changes. I'm not sure why is this happening. Can someone please have a look at my code and let me know what is wrong and how to rectify it?

library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
df4=read_excel("C:/Users/xyz/Desktop/eg1.xlsx")
df4_freq=df4 %>% group_by(Employee,Employee_id,Transaction_date,Vendor,Purpose,Expense_Type,
                              Amount) %>% summarise(count=n())
colnames(df4_freq)[8]= "No_of_Attendee"
plot=ggplot(d4_freq, aes(x = Transaction_date, y = Amount, 
                         text=paste('Employee:',Employee,
                                    '<br>No of Attendees:', No_of_Attendee,
                                    '<br>Amount Per Attendee:', Amount,
                                    '<br>Purpose:', Purpose,
                                    '<br>Vendor:', Vendor
                                    ))) + 
  geom_area(aes(color = Expense_Type, fill = Expense_Type), 
            alpha = 0.5, position = position_dodge(0.8))+
  geom_point(colour="black")+
  scale_color_manual(values = c("#CC6600", "#606060")) +
  scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(p, tooltip = c("x","y","text"))
plot

PART 2: The other problem that I'm facing is with area graph. If I enter only "employee" as the variable in the "text", my plot is perfect. But when I enter other variables such as "No_of_Attendee","Vendor" etc, my plot changes to straight lines. Is there any issue with ggplotly or text? For reference, I'm posting the code again, since I have added some more data to it.

library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
df4=data.frame("Employee"=c("Nancy","David","David","David","John","Mary","Mary","Mary","Mary",
"John","John","Richard","David","David","Mary","Mary","Mary"), 
"Employee_id"=c(1,2,2,2,3,4,4,4,4,3,3,5,2,2,4,4,4),
"Transaction_date"=c("12/27/2018","9/9/2017","9/9/2017","9/9/2017","10/4/2017","1/16/2019",
"1/16/2019","1/16/2019","1/16/2019","10/4/2017","10/4/2017","4/11/2018","1/1/2018","1/1/2018",
"4/5/2018","4/5/2018","4/5/2018"), 
"Expense_Type"=c("Individual_Meal","Group_Meal","Group_Meal","Group_Meal","Group_Meal",
"Group_Meal","Group_Meal","Group_Meal","Group_Meal","Group_Meal", "Group_Meal",
"Individual_Meal","Group_Meal","Group_Meal","Group_Meal" ,"Group_Meal","Group_Meal"),
"Attendees"=c("NA","Jess","Peter","David","Mike","Carol","Sonia","James","Mary","Maximus",
"John","NA","Arya","David","Jon","Elizabeth","Marco"),
"Vendor"=c("Chiles","Renaissance","Renaissance","Renaissance","Subway","Olive_Garden","Olive_Garden",
"Olive_Garden","Olive_Garden","Subway","Subway","Dominos","BJ","BJ","Little_Italy","Little_Italy","Little_Italy"),
"Purpose"=c("Dinner in NYC","External Business Meeting","External Business Meeting","External Business Meeting",
"Lunch with Mike and Maximus","summit with Intel","summit with Intel","summit with Intel","summit with Intel",
"Lunch with Mike and Maximus","Lunch with Mike and Maximus","Dinner in Ohio","Lunch with Arya","Lunch with Arya",
"Business_Meeting","Business_Meeting","Business_Meeting"),
"Amount"= c(128,600,600,600,130,235,235,235,235,130,130,50,95,95,310,310,310))
str(df4)
df4$Transaction_date<- as.Date(df4$Transaction_date, "%m/%d/%Y")
df4_freq=df4 %>% group_by(Employee,Employee_id,Transaction_date,Vendor,Purpose,Expense_Type)%>% summarise(No_of_Attendee=n(), Total_Amount=sum(Amount))

plot=ggplot(df4_freq, aes(x = Transaction_date, y = Total_Amount, 
                         text=paste('Employee:',Employee))) + 
  geom_area(aes(color = Expense_Type, fill = Expense_Type), 
            alpha = 0.5, position = position_dodge(0.8))+
  geom_point(colour="black")+
  scale_color_manual(values = c("#CC6600", "#606060")) +
  scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(plot, tooltip = c("x","y","text"))
plot 

Below is the plot which looks like perfect with only 'Employee' variable in the text. enter image description here

However, when I include other variables such as 'No_of_Attendee', 'Vendor' etc, my plot comes as single line. Below is the code and plot.

plot=ggplot(df4_freq, aes(x = Transaction_date, y = Total_Amount, 
                         text=paste('Employee:',Employee,
                                    '<br>No of Attendees:', No_of_Attendee,
                                     '<br>Total_Amount:', Total_Amount,
                                     '<br>Purpose:', Purpose,
                                     '<br>Vendor:', Vendor
                                    ))) + 
  geom_area(aes(color = Expense_Type, fill = Expense_Type), 
            alpha = 0.5, position = position_dodge(0.8))+
  geom_point(colour="black")+
  scale_color_manual(values = c("#CC6600", "#606060")) +
  scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(plot, tooltip = c("x","y","text"))
plot 

enter image description here

It would be really great and helpful if someone could help me what is wrong with my code.


Solution

  • It seems like by grouping by Amount you are preventing Total_Amount from being calculated. For example David's Meal on 9/9/2017 will create a group that represents those three rows, but then you can only summarize with count = n() which will count the number of rows in that group. But because you grouped on Amount you won't be able to produce a row that summarizes the Total_Amount. I would suggest the following to create the dataset you're looking for:

    data %>%
      group_by(Employee, Employee_id, Transaction_date, Expense_Type, Vendor, Purpose) %>%
      summarize(No_of_Attendee = n(),
                Total_Amount = sum(Amount))