Search code examples
pythondataframeplotlyplotly-pythondata-wrangling

Histogram of values that are inside an Array in Plotly


Been pondering this problem all day so I thought I'd post here to get an outside opinion. This is what my source data looks like

|Author     | Year | Level of Evidence | Surgery Type              |
|-----------|----- | ----------------- | ------------------------- |
| John Doe  | 2020 | 3                 | [Laparoscopy]             |
| Jane Doe  | 2019 | 1                 | [Laparoscopy, Cystoscopy] |
| James Doe | 2011 | 1                 | [Laparoscopy, Endoscopy]  |

I want to plot a histogram of the counts of Level of Evidence for each surgery. The fact that there can be multiple surgeries per Author is what is tripping me up. It's fairly simple to write some code to count the distributions of each and reformat it to look:

|Level of Evidence| Surgery Type | Count   |
|-----------------| ------------ | ------- |
|     1           |Laparoscopy   | 2       | 
|     1           |Cystoscopy    | 1       | 
|     1           |Endoscopy     | 1       | 
|     3           |Laparoscopy   | 1       | 

I can then easily call plotly.express.bar(df, x = 'Surgery Type', y = 'Count')

And get the exact bar chart that I am looking for. Here is where my problem lies. I want to be able to filter the data by Year, which I can't do when I count the frequencies on my own. Is there a way to use plotly.express.histogram function to treat [Laparoscopy, Cystoscopy] as two separate counts for the same Author? I know I can manually split the data like this in advance:

|Author      | Year  | Level of Evidence | Surgery Type  |
|------------| ----- | ----------------- | ------------- |
| John Doe   |2020   |      3            | [Laparoscopy] |
| Jane Doe   |2019   |      1            | [Laparoscopy] |
| Jane Doe   |2019   |      1            | [Cystoscopy]  |
| James Doe  |2011   |      1            | [Laparoscopy] |
| James Doe  |2011   |      1            | [Endoscopy]   |

However, I want to know if I can plot this straight from Plotly without having to manually split the data by Surgery Type prior. It looks to me like their sample datasets from the docs don't have arrays, and would instead be split manually prior to plotting. Is this the case? Alternatively, is there a function that can easily perform this type of split?

I appreciate your help!


Solution

  • First I will try to solve your problem on the following data:

    data = {"Author":['John Doe','John Doe','James Doe'], "Year":[2020, 2019, 2011], "Level":[3,1,1], 
            "Type":[['Laparoscopy','Laparoscopy'],['Laparoscopy', 'Cystoscopy','Laparoscopy'],['Laparoscopy', 'Endoscopy']] }
    
    df = pd.DataFrame.from_dict(data,orient='index').transpose()
    
         Author     Year  Level           Type
    0   John Doe    2020    3   [Laparoscopy, Laparoscopy]
    1   John Doe    2019    1   [Laparoscopy, Cystoscopy, Laparoscopy]
    2   James Doe   2011    1   [Laparoscopy, Endoscopy]
    

    Now, try to split the data automatically by explode:

    final_df = df.explode(['Type'])
    
         Author     Year  Level    Type
    0   John Doe    2020    3   Laparoscopy
    0   John Doe    2020    3   Laparoscopy
    1   John Doe    2019    1   Laparoscopy
    1   John Doe    2019    1   Cystoscopy
    1   John Doe    2019    1   Laparoscopy
    2   James Doe   2011    1   Laparoscopy
    2   James Doe   2011    1   Endoscopy
    

    Then group data based on all columns and get the size of each group:

    final_df.groupby(df.columns.to_list()).size().to_frame('count').reset_index()
    
         Author     Year  Level   Type    count
    0   James Doe   2011    1   Endoscopy   1
    1   James Doe   2011    1   Laparoscopy 1
    2   John Doe    2019    1   Cystoscopy  1
    3   John Doe    2019    1   Laparoscopy 2
    4   John Doe    2020    3   Laparoscopy 2
    

    As you can see, we treat each surgery type as two separate counts for the same author. I hope I can answer your question.