I have a data which looks like this:
| Employee | Employee_id | Transaction_date | Expense_Type | Attendees | Vendor | Purpose | Amount |
|----------|:-----------:|-----------------:|-----------------|-----------|--------------|-----------------------------|--------|
| Nancy | 1 | 12/27/2018 | Individual_Meal | NA | Chiles | Dinner in NYC | 128 |
| David | 2 | 9/9/2017 | Group_Meal | Jess | Renaissance | External Business Meeting | 600 |
| David | 2 | 9/9/2017 | Group_Meal | Peter | Renaissance | External Business Meeting | 600 |
| David | 2 | 9/9/2017 | Group_Meal | David | Renaissance | External Business Meeting | 600 |
| John | 3 | 10/4/2017 | Group_Meal | Mike | Subway | Lunch with Mike and Maximus | 130 |
| Mary | 4 | 1/16/2019 | Group_Meal | Carol | Olive_Garden | summit with Intel | 235 |
| Mary | 4 | 1/16/2019 | Group_Meal | Sonia | Olive_Garden | summit with Intel | 235 |
| Mary | 4 | 1/16/2019 | Group_Meal | James | Olive_Garden | summit with Intel | 235 |
| Mary | 4 | 1/16/2019 | Group_Meal | Mary | Olive_Garden | summit with Intel | 235 |
| John | 3 | 10/4/2017 | Group_Meal | Maximus | Subway | Lunch with Mike and Maximus | 130 |
| John | 3 | 10/4/2017 | Group_Meal | John | Subway | Lunch with Mike and Maximus | 130 |
| Richard | 5 | 4/11/2018 | Individual_Meal | NA | Dominos | Dinner in Ohio | 50 |
I want to aggregate the table in such a way that I can see the no of attendees for each employee and the total expense incurred for them. The final table should look something like this:
| Employee | Employee_id | Transaction_date | Expense_Type | Vendor | Purpose | No_of_Attendee | Total_Amount |
|----------|:-----------:|-----------------:|-----------------|--------------|-----------------------------|----------------|--------------|
| Nancy | 1 | 12/27/2018 | Individual_Meal | Chiles | Dinner in NYC | 1 | 128 |
| David | 2 | 9/9/2017 | Group_Meal | Renaissance | External Business Meeting | 3 | 1800 |
| John | 3 | 10/4/2017 | Group_Meal | Subway | Lunch with Mike and Maximus | 3 | 390 |
| Mary | 4 | 1/16/2019 | Group_Meal | Olive_Garden | summit with Intel | 4 | 940 |
| Richard | 5 | 4/11/2018 | Individual_Meal | Dominos | Dinner in Ohio | 1 | 50 |
Next, I want to generate an area plot where I have 'transaction date' on x axis and 'Amount' on y axis with different variables such as vendor, purpose mentioned in the tooltip. I I have tried some code but I'm not sure how to calculate frequency and sum of two different columns while retaining other columns as shown in the desired output table. Also, when I try to use text within ggplot2, the area graph comes fine until only employee is mentioned. As soon as I include vendor and/or purpose, the area graph changes. I'm not sure why is this happening. Can someone please have a look at my code and let me know what is wrong and how to rectify it?
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
df4=read_excel("C:/Users/xyz/Desktop/eg1.xlsx")
df4_freq=df4 %>% group_by(Employee,Employee_id,Transaction_date,Vendor,Purpose,Expense_Type,
Amount) %>% summarise(count=n())
colnames(df4_freq)[8]= "No_of_Attendee"
plot=ggplot(d4_freq, aes(x = Transaction_date, y = Amount,
text=paste('Employee:',Employee,
'<br>No of Attendees:', No_of_Attendee,
'<br>Amount Per Attendee:', Amount,
'<br>Purpose:', Purpose,
'<br>Vendor:', Vendor
))) +
geom_area(aes(color = Expense_Type, fill = Expense_Type),
alpha = 0.5, position = position_dodge(0.8))+
geom_point(colour="black")+
scale_color_manual(values = c("#CC6600", "#606060")) +
scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(p, tooltip = c("x","y","text"))
plot
PART 2: The other problem that I'm facing is with area graph. If I enter only "employee" as the variable in the "text", my plot is perfect. But when I enter other variables such as "No_of_Attendee","Vendor" etc, my plot changes to straight lines. Is there any issue with ggplotly or text? For reference, I'm posting the code again, since I have added some more data to it.
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
df4=data.frame("Employee"=c("Nancy","David","David","David","John","Mary","Mary","Mary","Mary",
"John","John","Richard","David","David","Mary","Mary","Mary"),
"Employee_id"=c(1,2,2,2,3,4,4,4,4,3,3,5,2,2,4,4,4),
"Transaction_date"=c("12/27/2018","9/9/2017","9/9/2017","9/9/2017","10/4/2017","1/16/2019",
"1/16/2019","1/16/2019","1/16/2019","10/4/2017","10/4/2017","4/11/2018","1/1/2018","1/1/2018",
"4/5/2018","4/5/2018","4/5/2018"),
"Expense_Type"=c("Individual_Meal","Group_Meal","Group_Meal","Group_Meal","Group_Meal",
"Group_Meal","Group_Meal","Group_Meal","Group_Meal","Group_Meal", "Group_Meal",
"Individual_Meal","Group_Meal","Group_Meal","Group_Meal" ,"Group_Meal","Group_Meal"),
"Attendees"=c("NA","Jess","Peter","David","Mike","Carol","Sonia","James","Mary","Maximus",
"John","NA","Arya","David","Jon","Elizabeth","Marco"),
"Vendor"=c("Chiles","Renaissance","Renaissance","Renaissance","Subway","Olive_Garden","Olive_Garden",
"Olive_Garden","Olive_Garden","Subway","Subway","Dominos","BJ","BJ","Little_Italy","Little_Italy","Little_Italy"),
"Purpose"=c("Dinner in NYC","External Business Meeting","External Business Meeting","External Business Meeting",
"Lunch with Mike and Maximus","summit with Intel","summit with Intel","summit with Intel","summit with Intel",
"Lunch with Mike and Maximus","Lunch with Mike and Maximus","Dinner in Ohio","Lunch with Arya","Lunch with Arya",
"Business_Meeting","Business_Meeting","Business_Meeting"),
"Amount"= c(128,600,600,600,130,235,235,235,235,130,130,50,95,95,310,310,310))
str(df4)
df4$Transaction_date<- as.Date(df4$Transaction_date, "%m/%d/%Y")
df4_freq=df4 %>% group_by(Employee,Employee_id,Transaction_date,Vendor,Purpose,Expense_Type)%>% summarise(No_of_Attendee=n(), Total_Amount=sum(Amount))
plot=ggplot(df4_freq, aes(x = Transaction_date, y = Total_Amount,
text=paste('Employee:',Employee))) +
geom_area(aes(color = Expense_Type, fill = Expense_Type),
alpha = 0.5, position = position_dodge(0.8))+
geom_point(colour="black")+
scale_color_manual(values = c("#CC6600", "#606060")) +
scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(plot, tooltip = c("x","y","text"))
plot
Below is the plot which looks like perfect with only 'Employee' variable in the text.
However, when I include other variables such as 'No_of_Attendee', 'Vendor' etc, my plot comes as single line. Below is the code and plot.
plot=ggplot(df4_freq, aes(x = Transaction_date, y = Total_Amount,
text=paste('Employee:',Employee,
'<br>No of Attendees:', No_of_Attendee,
'<br>Total_Amount:', Total_Amount,
'<br>Purpose:', Purpose,
'<br>Vendor:', Vendor
))) +
geom_area(aes(color = Expense_Type, fill = Expense_Type),
alpha = 0.5, position = position_dodge(0.8))+
geom_point(colour="black")+
scale_color_manual(values = c("#CC6600", "#606060")) +
scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(plot, tooltip = c("x","y","text"))
plot
It would be really great and helpful if someone could help me what is wrong with my code.
It seems like by grouping by Amount you are preventing Total_Amount from being calculated. For example David's Meal on 9/9/2017 will create a group that represents those three rows, but then you can only summarize with count = n() which will count the number of rows in that group. But because you grouped on Amount you won't be able to produce a row that summarizes the Total_Amount. I would suggest the following to create the dataset you're looking for:
data %>%
group_by(Employee, Employee_id, Transaction_date, Expense_Type, Vendor, Purpose) %>%
summarize(No_of_Attendee = n(),
Total_Amount = sum(Amount))