Been pondering this problem all day so I thought I'd post here to get an outside opinion. This is what my source data looks like
|Author | Year | Level of Evidence | Surgery Type |
|-----------|----- | ----------------- | ------------------------- |
| John Doe | 2020 | 3 | [Laparoscopy] |
| Jane Doe | 2019 | 1 | [Laparoscopy, Cystoscopy] |
| James Doe | 2011 | 1 | [Laparoscopy, Endoscopy] |
I want to plot a histogram of the counts of Level of Evidence for each surgery. The fact that there can be multiple surgeries per Author is what is tripping me up. It's fairly simple to write some code to count the distributions of each and reformat it to look:
|Level of Evidence| Surgery Type | Count |
|-----------------| ------------ | ------- |
| 1 |Laparoscopy | 2 |
| 1 |Cystoscopy | 1 |
| 1 |Endoscopy | 1 |
| 3 |Laparoscopy | 1 |
I can then easily call plotly.express.bar(df, x = 'Surgery Type', y = 'Count')
And get the exact bar chart that I am looking for. Here is where my problem lies. I want to be able to filter the data by Year, which I can't do when I count the frequencies on my own. Is there a way to use plotly.express.histogram
function to treat [Laparoscopy, Cystoscopy] as two separate counts for the same Author? I know I can manually split the data like this in advance:
|Author | Year | Level of Evidence | Surgery Type |
|------------| ----- | ----------------- | ------------- |
| John Doe |2020 | 3 | [Laparoscopy] |
| Jane Doe |2019 | 1 | [Laparoscopy] |
| Jane Doe |2019 | 1 | [Cystoscopy] |
| James Doe |2011 | 1 | [Laparoscopy] |
| James Doe |2011 | 1 | [Endoscopy] |
However, I want to know if I can plot this straight from Plotly without having to manually split the data by Surgery Type prior. It looks to me like their sample datasets from the docs don't have arrays, and would instead be split manually prior to plotting. Is this the case? Alternatively, is there a function that can easily perform this type of split?
I appreciate your help!
First I will try to solve your problem on the following data:
data = {"Author":['John Doe','John Doe','James Doe'], "Year":[2020, 2019, 2011], "Level":[3,1,1],
"Type":[['Laparoscopy','Laparoscopy'],['Laparoscopy', 'Cystoscopy','Laparoscopy'],['Laparoscopy', 'Endoscopy']] }
df = pd.DataFrame.from_dict(data,orient='index').transpose()
Author Year Level Type
0 John Doe 2020 3 [Laparoscopy, Laparoscopy]
1 John Doe 2019 1 [Laparoscopy, Cystoscopy, Laparoscopy]
2 James Doe 2011 1 [Laparoscopy, Endoscopy]
Now, try to split the data automatically by explode
:
final_df = df.explode(['Type'])
Author Year Level Type
0 John Doe 2020 3 Laparoscopy
0 John Doe 2020 3 Laparoscopy
1 John Doe 2019 1 Laparoscopy
1 John Doe 2019 1 Cystoscopy
1 John Doe 2019 1 Laparoscopy
2 James Doe 2011 1 Laparoscopy
2 James Doe 2011 1 Endoscopy
Then group data based on all columns and get the size of each group:
final_df.groupby(df.columns.to_list()).size().to_frame('count').reset_index()
Author Year Level Type count
0 James Doe 2011 1 Endoscopy 1
1 James Doe 2011 1 Laparoscopy 1
2 John Doe 2019 1 Cystoscopy 1
3 John Doe 2019 1 Laparoscopy 2
4 John Doe 2020 3 Laparoscopy 2
As you can see, we treat each surgery type as two separate counts for the same author. I hope I can answer your question.