Search code examples
python-3.xpandasmatplotlibplotplotly

Plotting 3D plot with cross-correlation and different marker sizes to show N


I have a dataframe like this:

enter image description here

data = [[1, 2, 3], [2, 2, 3], [1, 1, 2], [2, 2, 2], [2, 3, 2], [2, 3, 3], [1, 1, 1], [3, 3, 3], [3, 3, 2], [1, 2, 1], [1, 3, 1], [3, 1, 3], [2, 1, 2], [3, 3, 3], [2, 2, 2], [3, 2, 1], [3, 2, 2], [2, 2, 1], [1, 1, 3], [1, 3, 2], [1, 2, 3]]
df = pd.DataFrame(data, columns=['Math', 'Science', 'English'])

Where each row depicts a student. The ratings 1-3 are as follows: 1 is Poor grade, 2 is Average grade, 3 is Good grade

What I'm interested in doing is to create a 3D plot that shows correlation between a student's grades and also shows N (number).

I'm not sure if the subjects should be on the x y z axis, or their grade (would like to try both). Instead of just a 3D scatter plot, I'll like markers in the same place (for example, a student who has 3 (good grade) in all 3 categories) to be bigger to show the N. Essentially, I'll like the size of the marker to show the N, and either their grade or subject (or both) to be colour-coded.

Is there a way to visualize this type of data in a clear manner? I was thinking of something like a 3D venn diagram, but I cannot figure it out. I tried using multi-hierarchy circlify but couldn't achieve a 3D aspect, where it cross-correlates both grades and subjects. Any guidance will be super appreciated!

edit: N values for this sample dataframe:

enter image description here enter image description here


Solution

  • I created some sample data with the same format as yours with 200 students having ratings of 1-3 occuring pseudo-randomly (but with a higher proportion of 2's and 3's so that we can see a pattern).

    Additional edits: (1) you can use a mapping to replace the ratings 1,2,3 with "poor", "average","good" in the df itself, and plotly will understand this is meant to be categorical data and reflect this in the axes on the 3d scatter and (2) to color the markers in a consistent way, we need to combine the ratings from the subjects – the most straightforward way is to create a new column "sum" that is the sum of ratings from all three subjects, and pass the name of this column to px.scatter_3d

    And you can remove the information about count from the hovertemplate using the following: : fig.update_traces(hovertemplate="Math=%{x}<br>Science=%{y}<br>English=%{z}<br>sum=%{marker.color}<extra></extra>")

    import numpy as np
    import pandas as pd
    import plotly.express as px
    import plotly.graph_objects as go
    
    ## create some random data where there will be clusters
    np.random.seed(42)
    data = np.random.choice([1,2,3],size=[200,3], p=[0.2,0.3,0.5])
    df = pd.DataFrame(data, columns=['Math', 'Science', 'English'])
    
    rating_map = {1: 'poor', 2: 'average', 3:'good'}
    
    ## count the number of times each unique combination of grades occurs
    df_counts = df.value_counts().rename('counts').reset_index()
    df_counts['sum'] = df_counts['Math'] + df_counts['Science'] + df_counts['English']
    df_counts[['Math','Science','English']] = df_counts[['Math','Science','English']].applymap(lambda x: rating_map[x])
    
    fig = px.scatter_3d(df_counts, x='Math', y='Science', z='English', size='counts', size_max=50, color='sum')
    fig.update_layout(coloraxis_colorbar=dict(
        title="Combined Rating",
        tickvals=[3,6,9],
        ticktext=["Poor", "Average", "Good"],
    ))
    
    fig.show()
    

    enter image description here