Search code examples
pythonpandasseabornscatter-plotstandard-deviation

python - scatter plot issue - not sure how to structure the plot for the results i want?


i have a dataframe of video game titles that were released across multiple platforms, along with their total sales. it looks like this:

    name                        total_sales platform
0   Frozen: Olaf's Quest            0.51    DS
1   Frozen: Olaf's Quest            0.59    3DS
2   007: Quantum of Solace          0.02    PC
3   007: Quantum of Solace          0.13    DS
4   007: Quantum of Solace          0.43    PS2
5   007: Quantum of Solace          0.65    Wii
6   007: Quantum of Solace          1.15    PS3
7   007: Quantum of Solace          1.48    X360
8   007: The World is not Enough    0.92    PS
9   007: The World is not Enough    1.56    N64
10  11eyes: CrossOver               0.02    PSP
11  18 Wheeler: American Pro Truc   0.11    GC
12  18 Wheeler: American Pro Truc   0.40    PS2
13  187: Ride or Die                0.06    XB
14  187: Ride or Die                0.15    PS2
15  2 in 1 Combo Pack: Sonic Heroes 0.11    X360
16  2 in 1 Combo Pack: Sonic Heroes 0.53    XB
17  2002 FIFA World Cup             0.05    GC
18  2002 FIFA World Cup             0.19    XB
19  2002 FIFA World Cup             0.60    PS2

i'm using the following to organize the dataframe:

df = yearly_sales.groupby(['name','total_sales']).last()
df = yearly_sales.reset_index()

then plotting it on a seaborn scatter plot:

sns.scatterplot(data=yearly_sales, x="total_sales", y="name")

now, it won't plot by name (i'm guessing because there are 7400 values) So i thought i'd try and calculate the deviation between platforms:

df.groupby(['name','platform'])['total_sales'].std()

but, this mostly gives me NaN values, because, few if any games are across all platforms.

i'm not sure what my next step should be. ultimately, what i want to show is how the total sales of each title differs across platforms. i'm not even totally confident that i'm approaching this the right way to begin with.

any input would be greatly appreciated!

thanks for your time in advance,

Jared


Solution

  • I think a histplot would be a better way to visualize this problem if "ultimately, what i want to show is how the total sales of each title differs across platforms" This shows the frequency of games with standard deviations (grouped by game) in 0.1 bins. You can pass ddof=0 to not return NaN values, but that will change the standard devation of all values.

    import seaborn as sns
    import matplotlib.pyplot as plt
    from matplotlib.ticker import MaxNLocator
    plt.style.use('dark_background')
    fig, ax = plt.subplots(dpi=150)
    df = df[['name', 'total_sales']].groupby('name', as_index=False).std(ddof=0)
    sns.histplot(data=df, x='total_sales', kde=True, bins=np.arange(0,1,0.1))
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    

    enter image description here