python-3.x pandas matplotlib plot plotly

Plotting 3D plot with cross-correlation and different marker sizes to show N

I have a dataframe like this:

data = [[1, 2, 3], [2, 2, 3], [1, 1, 2], [2, 2, 2], [2, 3, 2], [2, 3, 3], [1, 1, 1], [3, 3, 3], [3, 3, 2], [1, 2, 1], [1, 3, 1], [3, 1, 3], [2, 1, 2], [3, 3, 3], [2, 2, 2], [3, 2, 1], [3, 2, 2], [2, 2, 1], [1, 1, 3], [1, 3, 2], [1, 2, 3]]
df = pd.DataFrame(data, columns=['Math', 'Science', 'English'])

Where each row depicts a student. The ratings 1-3 are as follows: 1 is Poor grade, 2 is Average grade, 3 is Good grade

What I'm interested in doing is to create a 3D plot that shows correlation between a student's grades and also shows N (number).

I'm not sure if the subjects should be on the x y z axis, or their grade (would like to try both). Instead of just a 3D scatter plot, I'll like markers in the same place (for example, a student who has 3 (good grade) in all 3 categories) to be bigger to show the N. Essentially, I'll like the size of the marker to show the N, and either their grade or subject (or both) to be colour-coded.

Is there a way to visualize this type of data in a clear manner? I was thinking of something like a 3D venn diagram, but I cannot figure it out. I tried using multi-hierarchy circlify but couldn't achieve a 3D aspect, where it cross-correlates both grades and subjects. Any guidance will be super appreciated!

edit: N values for this sample dataframe:

Solution

I created some sample data with the same format as yours with 200 students having ratings of 1-3 occuring pseudo-randomly (but with a higher proportion of 2's and 3's so that we can see a pattern).

Additional edits: (1) you can use a mapping to replace the ratings 1,2,3 with "poor", "average","good" in the df itself, and plotly will understand this is meant to be categorical data and reflect this in the axes on the 3d scatter and (2) to color the markers in a consistent way, we need to combine the ratings from the subjects – the most straightforward way is to create a new column "sum" that is the sum of ratings from all three subjects, and pass the name of this column to px.scatter_3d

And you can remove the information about count from the hovertemplate using the following: : fig.update_traces(hovertemplate="Math=%{x}<br>Science=%{y}<br>English=%{z}<br>sum=%{marker.color}<extra></extra>")

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## create some random data where there will be clusters
np.random.seed(42)
data = np.random.choice([1,2,3],size=[200,3], p=[0.2,0.3,0.5])
df = pd.DataFrame(data, columns=['Math', 'Science', 'English'])

rating_map = {1: 'poor', 2: 'average', 3:'good'}

## count the number of times each unique combination of grades occurs
df_counts = df.value_counts().rename('counts').reset_index()
df_counts['sum'] = df_counts['Math'] + df_counts['Science'] + df_counts['English']
df_counts[['Math','Science','English']] = df_counts[['Math','Science','English']].applymap(lambda x: rating_map[x])

fig = px.scatter_3d(df_counts, x='Math', y='Science', z='English', size='counts', size_max=50, color='sum')
fig.update_layout(coloraxis_colorbar=dict(
    title="Combined Rating",
    tickvals=[3,6,9],
    ticktext=["Poor", "Average", "Good"],
))

fig.show()