python pandas plotly plotly-dash plotly-python

Analysis of categorical variables based on three dependent dropdowns in pandas

I have a dataframe which looks something like this:

df = pd.DataFrame ({'id': {0: 84, 1: 84, 2: 84, 3: 84, 4: 124},
               'Version': { 0: 'SemVer4', 1: 'Timestamps', 2: 'Snapshots', 3: 'Names', 4: 'Numbered Versions'},
               'server_Version': {0: 'v1', 1: 'v2', 2: 'api/v1', 3: '1.1.0', 4: 'v4'},
               'owner': {0: 'vmware', 1:'microsoft', 2: 'nasa', 3: 'swagger-API', 4:'sqaas'},
               'repo_name': {0: 'container-service-extension', 1: 'azure-rest-api-specs', 2: 'api.nasa.gov', 3: 'swagger-ui', 4: 'sqaas'},
               'filepath':{0: 'openapi.yaml', 1: 'dapper.json', 2: 'dockstore-webservice/src/main/resources/openapi3/openapi.yaml', 3: 'api/cmd/kubermatic-api/swagger.json', 4: 'cmd/spec/openapi.jsonsqaas'}})

I want to create a visualization that takes as input three dependent dropdowns: owner, repo name and filepath. The relation between these three is that owner represents the main API name,reponames are the names of the repositories under the owner and filepaths are the names for different operations within the selected repo.

My desired output for the dropdowns is like this:

I was able to create these dropdowns in Elasticsearch, but the issue arises in analysing the data, since some fields are not properly parsed in kibana.

The visualization through which I want the dropdown filtering applied is:

fig = px.scatter(df.query("owner=='swagger-api'"), x="Year", y="Month", color="Version", text="server_version")
fig.update_traces(textposition="bottom right")
fig.show()

What I am doing here is for over the years and months, I find out which API versions and which server versions have been used, but as I am doing it for one owner as you can see in the code, I want some filtering that displays it for specific repo and filepath as well.

I am currently using Plotly for my visualizations, and I am quite new to this library. If there is any other library, that can help me achieve this, do let me know. Any help or suggestions on how to proceed with this is highly appreciated!

Solution

Since you mentioned that you want the dropdowns to be dependent, the dropdowns have to be aware of the state of the other dropdowns – this isn't possible in plotly, but this is possible in plotly-dash since callbacks are supported.

To do this, we can write an update function that takes all possible selections for your three dropdowns, and updates all other dropdowns based on subsetting your df. And I figured that in the same function, it would make sense to update the figure accordingly as well.

The only tricky part is that when you make a selection from one dropdown but don't make selections from the others, you get None as an input, but if you clear the selection(s) from a dropdown you get [] as an input, so your update function needs to account for that situation. And if you are clear the selections from all dropdowns, you probably want all possible dropdown options to show up again – this possibility is also accounted for.

I also extended your sample df to include multiple rows for some of the dropdowns just to check that the dash app still works for this situation. There might be some edge cases I haven't accounted for, but for now this solution seems to be working.

import pandas as pd
import plotly.express as px
from dash import Dash, dcc, html, Input, Output, ctx

df = pd.DataFrame ({'id': {0: 84, 1: 84, 2: 84, 3: 84, 4: 124, 5:1, 6:1},
               'Version': { 0: 'SemVer4', 1: 'Timestamps', 2: 'Snapshots', 3: 'Names', 4: 'Numbered Versions', 5: 'test', 6: 'test'},
               'server_Version': {0: 'v1', 1: 'v2', 2: 'api/v1', 3: '1.1.0', 4: 'v4', 5: 'v5', 6: 'v5'},
               'owner': {0: 'vmware', 1:'microsoft', 2: 'nasa', 3: 'swagger-API', 4:'sqaas',5:'vmware',6:'nasa'},
               'Year': {0: '2018', 1:'2020', 2:'2018', 3:'2019', 4:'2019', 5:'2021',6:'2021'},
               'Month': {0: 1, 1:6, 2:2, 3:4, 4:5, 5:5, 6:10},
               'repo_name': {0: 'container-service-extension', 1: 'azure-rest-api-specs', 2: 'api.nasa.gov', 3: 'swagger-ui', 4: 'sqaas', 5:'vmware-test',6:'nasa-test'},
               'filepath':{0: 'openapi.yaml', 1: 'dapper.json', 2: 'dockstore-webservice/src/main/resources/openapi3/openapi.yaml', 3: 'api/cmd/kubermatic-api/swagger.json', 4: 'cmd/spec/openapi.jsonsqaas', 5:'vmware-test-path',6:'nasa-test-path'}})

df = df.sort_values(by='Year')

## default is to show all data
fig = px.scatter(df, x="Year", y="Month", color="Version", text="server_Version")
fig.update_traces(textposition="bottom right")

app = Dash(__name__)

## three dependent dropdowns: owner, repo name and filepath
dropdown_selections = {
    category:df[category].unique().tolist() 
    for category in ['owner','repo_name','filepath']
}

dropdown_id_to_col_mapping = {
    'owner-dropdown':'owner',
    'repo-name-dropdown': 'repo_name',
    'filepath-dropdown': 'filepath'
}

app.layout = html.Div(
    [
        html.Div(
            children=[
                dcc.Dropdown(
                    dropdown_selections['owner'], 
                    id='owner-dropdown',
                    placeholder="Select owner",
                    style={"display": "inline-block", "width": "220px"},
                    multi=True,
                ),
                dcc.Dropdown(
                    dropdown_selections['repo_name'], 
                    id='repo-name-dropdown',
                    placeholder="Select repo name",
                    style={"display": "inline-block", "width": "220px", 'padding-left': '5px'}, 
                    multi=True
                ),
                dcc.Dropdown(
                    dropdown_selections['filepath'], 
                    id='filepath-dropdown', 
                    placeholder="Select filepath",
                    style={"display": "inline-block", "width": "220px", 'padding-left': '5px'},
                    multi=True
                )
            ],
            style={"padding": "10px", 'padding-left': '6%'},
        ),
        dcc.Graph(figure=fig, id='px-scatter-fig')
    ]
)

## callback so that a selection from one figure updates the others
@app.callback(
    Output('owner-dropdown', 'options'),
    Output('repo-name-dropdown', 'options'),
    Output('filepath-dropdown', 'options'),
    Output('px-scatter-fig', 'figure'),
    Input('owner-dropdown', 'value'),
    Input('repo-name-dropdown', 'value'),
    Input('filepath-dropdown', 'value'),
    prevent_initial_call=True
)
def update_dropdowns(owner_selection, repo_name_selection, filepath_selection):
    # print(f'you selected: {owner_selection}, {repo_name_selection}, {filepath_selection}')
    dropdown_selected = ctx.triggered_id
    col_selected = dropdown_id_to_col_mapping[dropdown_selected]

    change_dropdowns = ['owner-dropdown','repo-name-dropdown','filepath-dropdown']
    change_dropdowns.remove(dropdown_selected)

    ## if you clear ALL dropdown selections, then we reset all dropdowns
    ## (and this will skip all of the following other if statements)
    if (owner_selection == []) & (repo_name_selection == []) & (filepath_selection == []):
        owner_selection = dropdown_selections['owner']
        repo_name_selection = dropdown_selections['repo_name']
        filepath_selection = dropdown_selections['filepath']

    ## if any dropdowns are cleared or not selected, we want all possible selections to return
    ## this is because if we clear some but not all dropdowns, the conditions from other dropdowns remain in place
    ## and the new dropdown selections are calculated from the subset dataframe and should be correct
    if owner_selection == []:
        owner_selection = dropdown_selections['owner']
    if repo_name_selection == []:
        repo_name_selection = dropdown_selections['repo_name']
    if filepath_selection == []:
        filepath_selection = dropdown_selections['filepath']

    if owner_selection == None:
        owner_selection = dropdown_selections['owner']
    if repo_name_selection == None:
        repo_name_selection = dropdown_selections['repo_name']
    if filepath_selection == None:
        filepath_selection = dropdown_selections['filepath']
    
    ## subset the dataframe by dropdown conditions
    df_subset = df[
        df['owner'].isin(owner_selection)
        & df['repo_name'].isin(repo_name_selection)
        & df['filepath'].isin(filepath_selection)
    ]

    owner_selection = df_subset['owner'].unique().tolist()
    repo_name_selection = df_subset['repo_name'].unique().tolist()
    filepath_selection = df_subset['filepath'].unique().tolist()

    fig_update = px.scatter(df_subset, x="Year", y="Month", color="Version", text="server_Version")
    fig_update.update_traces(textposition="bottom right")

    return owner_selection, repo_name_selection, filepath_selection, fig_update

if __name__ == '__main__':
    app.run_server(debug=True)