I have been able to create the following visualization in python but would like to re-create it in r. The data can be found further down in the r code.
The python code I wrote to generate the below graph from the data is:
import matplotlib.pyplot as plt
import pandas as pd
# Load the data
portfolio_data = pd.read_excel("Data.xlsx")
# Define colors for each Therapeutic Area (TA)
ta_colors = {
'Malaria': 'lightblue',
'HIV': 'lightgreen',
# Additional colors can be added for other TAs if present in the dataset
}
# Define the width of the bars to adjust the diamond symbol position
bar_width = 0.8
plt.figure(figsize=(12, 8))
# For each phase, plot the projects, label them, color them by TA, add symbol for external funding, and draw border for NME type
for idx, phase in enumerate(portfolio_data['Phase'].unique()):
phase_data = portfolio_data[portfolio_data['Phase'] == phase]
bottom_offset = 0
for index, row in phase_data.iterrows():
edge_color = 'black' if row['Type'] == 'NME' else None # Add border if project type is NME
plt.bar(idx, 1, bottom=bottom_offset, color=ta_colors[row['TA']], edgecolor=edge_color, linewidth=1.2)
plt.text(idx, bottom_offset + 0.5, row['Project'], ha='center', va='center', fontsize=10)
# Add diamond symbol next to projects with external funding, positioned on the right border of the bar
if row['Funding'] == 'External':
plt.text(idx + bar_width/2, bottom_offset + 0.5, u'\u25C6', ha='right', va='center', fontsize=10, color='red')
bottom_offset += 1
# Adjust x-ticks to match phase names
plt.xticks(range(len(portfolio_data['Phase'].unique())), portfolio_data['Phase'].unique())
# Create legends for the TAs and external funding separately
legend_handles_ta = [plt.Rectangle((0, 0), 1, 1, color=ta_colors[ta], label = ta) for ta in ta_colors.keys() ]
legend_external_funding = [plt.Line2D([0], [0], marker='D', color='red', markersize=10, label='External Funding', linestyle='None')]
legend_nme = [plt.Rectangle((0, 0), 1, 1, edgecolor='black', facecolor='none', linewidth=1.2, label='NME Type')]
# Add legends to the plot
legend1 = plt.legend(handles=legend_handles_ta, title="Therapeutic Area (TA)", loc='upper left')
plt.gca().add_artist(legend1)
legend2 = plt.legend(handles=legend_external_funding, loc='upper right')
plt.gca().add_artist(legend2)
plt.legend(handles=legend_nme, loc='upper center')
plt.title('Number of Projects by Phase, Colored by TA, with Symbol on Bar Border for External Funding and Border for NME Type')
plt.xlabel('Phase')
plt.ylabel('Number of Projects')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Here is what the result looks like: enter image description here
In my attempts to replicate the output in r, I have tried the following code:
library(ggplot2)
library(dplyr)
portfolio_data <- read.table(text = "Project Phase Funding TA Type
Project1 I Internal Malaria NME
Project2 I Internal Malaria NME
Project3 I Internal Malaria NME
Project4 I External HIV NME
Project5 I Internal HIV NME
Project10 II Internal Malaria NME
Project11 II Internal Malaria NME
Project12 II Internal Malaria NME
Project17 II External Malaria LCM
Project18 II External HIV LCM
Project19 II Internal HIV LCM
Project20 III External Malaria NME
Project21 III Internal Malaria NME
Project22 III External Malaria LCM
Project23 III Internal HIV LCM
Project24 III External HIV NME
Project25 III Internal Malaria LCM
Project26 III External HIV LCM
Project27 III Internal HIV NME
", header=TRUE)
portfolio_data <- portfolio_data %>%
mutate(dummy = 1)
ta_colors <- c(
Malaria = "lightblue",
HIV = "lightgreen"
)
type_colors <- c(
NME = "black",
LCM = "white"
)
# Create the plot
plot <- ggplot(portfolio_data, aes(x = Phase, y = dummy, fill = TA, label = Project)) +
geom_col() +
#add project name as labels
geom_text(aes(label = Project)
, position = position_stack(vjust = .5)) +
#add borders by Type
geom_col(aes(color = Type)
, fill = NA
, size = 1) +
#add colors for TA and Type
scale_fill_manual(values = ta_colors) +
scale_color_manual(values = type_colors) +
#diamonds for projects with external funding
geom_text(aes(label = if_else(Funding == "External", "\u25C6", NA))
, vjust = 0.5, hjust = -6.8, color = "red", size = 5
, position = position_stack(vjust = .5)) +
# Theme and labels
labs(title = "Number of Projects by Phase, Colored by TA, with Symbol on Bar Border for External Funding and Border for NME Type",
x = "Phase",
y = "Number of Projects") +
theme_minimal()
print(plot)
I got the following result: enter image description here
The problem is that the borders are not correct. For example, Project 24 is an NME project. It seems that the second geom_col() call re-orders the projects so that the link between the Project and Type is no longer maintained. Is there a way around this? I wanted to use the built in functionality to draw borders but maybe I should consider adding a separate layer with boxes around the labels? I also tried geom_bar() but no success. Perhaps there are even better ways. Any help appreciated.
The main issue is the grouping. When using position_stack
the order of the stack is determined by the group
aes. If not explicitly set, ggplot2
will infer or set the group
based on the categorical variables mapped on other aesthetics, e.g. in your case the grouping is set according to fill
, color
and label
. Moreover, each layer has its own (default) grouping, e.g. in case of your second geom_col
you drop the grouping by fill
as you set fill=NA
. As a consequence you get a different grouping for this layer.
Hence, especially in case of complex plots like yours, which involve multiple geoms and aesthetics, the default grouping will not always give you the desired result. Instead you have to set it explicitly. In your case the the stack should be ordered by and only by Project
, i.e. add group = Project
to aes()
.
Besides that I did some additional Tweaks. First, I reversed the order of the stacks using position_stack(..., reverse = TRUE)
. Second, I have set the outline color to "transparent"
for the "LCM"
type. Third, I switched to geom_point
to add the diamonds which allows to use the shape
aes and accordingly to get a third (shape) legend as in your python plot. Finally, I tweaked the legends via theme()
and guides()
.
library(ggplot2)
type_colors <- c(
NME = "black",
LCM = "transparent"
)
ps <- position_stack(vjust = .5, reverse = TRUE)
ggplot(
portfolio_data,
aes(x = Phase, y = dummy, group = Project)
) +
geom_col(aes(fill = TA), position = ps) +
geom_col(aes(color = Type),
fill = NA,
linewidth = 1, position = ps
) +
geom_text(aes(label = Project), position = ps) +
geom_point(
aes(
x = as.numeric(factor(Phase)) + .35,
shape = Funding == "External"
),
color = "red", size = 5,
position = ps
) +
scale_shape_manual(
values = c(18, NA),
labels = "External",
breaks = "TRUE"
) +
scale_fill_manual(
values = ta_colors
) +
scale_color_manual(
values = type_colors,
breaks = "NME"
) +
# Theme and labels
labs(
title = "Number of Projects by Phase, Colored by TA, with Symbol on Bar Border for External Funding and Border for NME Type",
x = "Phase",
y = "Number of Projects",
shape = "Funding"
) +
theme_minimal() +
theme(
legend.position = "top",
legend.direction = "vertical"
) +
guides(
color = guide_legend(title.position = "top", order = 2),
fill = guide_legend(title.position = "top", order = 1),
shape = guide_legend(title.position = "top", order = 3)
)