I am trying to draw a countplot for movie ratings data overlayed with a vertical line that represents the median. I am having success by plotting the countplot:
But when I try to draw a vertical line using the column's median, but it's drawn on the wrong place:
That happens because the columns median is 4, so pyplot tries do draw the line on the 4th tick (starting at 0). I could manually convert the scale but i want to know if there is a more ortodox way of sharing the same scale with both figures. I have tried using plt.xticks(np.arange(0, 5, step=0.5) but it didn't work. Here
fig, ax = plt.subplots(figsize=(9,6))
plt.xticks(np.arange(0, 5, step=0.5))
sns.countplot(x=df['Latest Rating'], ax=ax).set(title="Latest Rating distribution")
plt.axvline(x=df['Latest Rating'].median(),
color='blue',
ls='--',
lw=2.5)
I have tried setting the xticks manually and tried using a separated axis. Both times the result was the same as the print annexed in the question, the second plot still used the wrong scale
From what I can see, the x-axis made from sns.countplot
isn't actually numeric but more so string values of the ratings. So, the x-axis really starts at 0 (the first bar) and then each bar is one more. So "0.5" is actually 0 on the x-axis and "3.0" is actually 5 on the x-axis. So your median is 4 and thus the axvline places it on "2.5" because "2.5" is position 4 in your list:
0 0.5
1 1.0
2 1.5
3 2.0
4 2.5 # <--- notice it is at position/index 4
5 3.0
6 3.5
7 4.0
8 4.5
9 5.0
What you can do then, is find the index that your median value is at in the list of all possible rating values and then place the axvline at that position instead:
fig, ax = plt.subplots(figsize=(9,6))
df = pd.DataFrame({'Latest Rating' : [0.5, 1,1,1,1,1.5,1.5,2,2,2,2,2.5,2.5,2.5,3,3,3,3,3,3,3,3.5,3.5,4,4,4,4,4,4,4,4,4,4,4,4.5,4.5,4.5,4.5,4.5,5,5,5,5,5]})
sns.countplot(x = df['Latest Rating'])
# sort the values,then unique them, and make the unique array a list
# find where the median's index is in the unique list and then place the line at that point
plt.axvline(df['Latest Rating'].sort_values().unique().tolist().index(df['Latest Rating'].median()), color = 'red')
# or just a make a list of the known ratings and use that instead
# this method is better if you want to show all possible ratings even if one rating isn't
# represented in your data (e.g. if there are no "0.5" ratings, etc.)
rating_list = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
plt.axvline(rating_list.index(df['Latest Rating'].median()), color = 'blue', linestyle = '--')