Plotly Dash scatter plot: pointNumber is assigned to multiple points in hover data

I ran into an issue when using Plotly and Dash for retrieving hover data via hovering the cursor over points in a scatter plot. The hover data retrieved from the Dash app seems to contain the same pointNumber and pointIndex for multiple points in the same plot. This makes it impossible to display the correct information associated to a given instance when hovering over the respective point.

Here is a simplified example which can be run in a Jupyter notebook. In the end I will want to display images on hovering.

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
from jupyter_dash import JupyterDash
from dash import dcc, html, Input, Output, no_update
import plotly.express as px

# Loading iris data to pandas dataframe
data = load_iris()
images = data.data
labels = data.target

df = pd.DataFrame(images[:, :2], columns=["feat1", "feat2"])
df["label"] = labels

# Color for each class
color_map = {0: "setosa",
             1: "versicolor",
             2: "virginica"}

colors = [color_map[l] for l in labels]

df["color"] = colors

pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)

# Setup plotly scatter plot
fig = px.scatter(df, x="feat1", y="feat2", color="color")
fig.update_traces(hoverinfo="none",
                  hovertemplate=None)

# Setup Dash
app = JupyterDash(__name__)
app.layout = html.Div(className="container",
                      children=[dcc.Graph(id="graph-5", figure=fig, clear_on_unhover=True),
                                dcc.Tooltip(id="graph-tooltip-5", direction="bottom")])

@app.callback(Output("graph-tooltip-5", "show"),
              Output("graph-tooltip-5", "bbox"),
              Output("graph-tooltip-5", "children"),
              Input("graph-5", "hoverData"))

def display_hover(hoverData):
    if hoverData is None:
        return False, no_update, no_update
    
    print(hoverData)

    hover_data = hoverData["points"][0]
    bbox = hover_data["bbox"]
    num = hover_data["pointNumber"]
    
    children = [html.Div([html.Img(style={"height": "50px", 
                                          "width": "50px", 
                                          "display": "block", 
                                          "margin": "0 auto"}),
                                   html.P("Feat1: {}".format(str(df.loc[num]["feat1"]))),
                                   html.P("Feat2: {}".format(str(df.loc[num]["feat2"])))])]

    return True, bbox, children

if __name__ == "__main__":
    app.run_server(mode="inline", debug=True)

The problem can be observed for example with the following two instances retrieved via print(df):

index feat1 feat2 label color
31 5.4 3.4 0 setosa
131 7.9 3.8 2 virginica

Both are assigned the same pointNumber and pointIndex retrieved via print(HoverData):

{'points': [{'curveNumber': 2, 'pointNumber': 31, 'pointIndex': 31, 'x': 7.9, 'y': 3.8, 'bbox': {'x0': 1235.5, 'x1': 1241.5, 'y0': 152.13, 'y1': 158.13}}]}

{'points': [{'curveNumber': 0, 'pointNumber': 31, 'pointIndex': 31, 'x': 5.4, 'y': 3.4, 'bbox': {'x0': 481.33, 'x1': 487.33, 'y0': 197.38, 'y1': 203.38}}]}

This is the visualization when hovering over the two instances. The hovering information is wrong for the image on the right side.

Interestingly, the issue resolves when using

fig = px.scatter(df, x="feat1", y="feat2", color="label")

However, this will cause the legend to be displayed in a continuous manner and disable the possibility to selectively visualize instances associated to specific classes in the HTML.

Is this a bug or am I overlooking something? Any help is much appreciated!

Solution

It turned out that I incorrectly expected pointNumber and pointIndex to be unique. The point numbers and indices are renumbered for each class as soon as a non-numeric column is used as color parameter in px.scatter(). Points in the scatterplot can be uniquely identified by combining curveNumber and one of pointNumber and pointIndex.

A potential solution is to generate separate indices for each class and add them to the dataframe:

curve_indices = np.array([np.arange(0, num_samples) for num_samples in np.unique(class_annot, return_counts=True)[1]], dtype="object")
curve_indices = np.concatenate(curve_indices).ravel()
df["curve_index"] = curve_indices

In the callback function the correct indices in the dataframe for each instance can then be identified using

 df_index = df[(df.label == curve) & (df.curve_index == num)].index[0]