Search code examples
pythonpython-2.7pandasnumpycdf

CDF x value at 50% and mean don't show the same number


I have a dataframe, and I created a CDF of the days column:

...
#create DF from SQL
df = pd.read_sql_query(query, conn)

days = df['days'].dropna()

#create CDF definition
def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1.0, n+1) / n
    return x, y

#unpack x and y
x, y = ecdf(days)
sns.set()

#plot CDF
ax = plt.plot(x, y, marker='.', linestyle='none') 

#Overlay quartiles
percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0
ax = plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay percentiles

#get current axes and add annotation and quartile points
ax=plt.gca()
for x,y in zip(x_p, y_p):                                        
    ax.annotate('%s' % x, xy=(x,y), xytext=(15,0), textcoords='offset points')

At the 50% mark, the datapoint in the overlay of the CDF is showing me 120 average, however print(np.mean(df['days_to_engaged'])) gives me 154.

Why the discrepancy?

print(df['days'].dropna()):

389
350
130
344
392
92
51
28
309
357
64
380
332
109
284
105
50
66
156
116
75
315
155
34
155
241
320
50
97
41
274
99
133
95
306
62
187
56
110
338
102
285
386
231
238
145
216
148
105
368
176
155
106
107
36
16
28
6
322
95
122
82
64
35
72
214
192
91
117
277
101
159
96
325
79
154
314
142
147
138
48
50
178
146
224
282
141
75
151
93
135
82
125
111
49
113
165
19
118
105
92
133
77
54
72
34

Solution

  • You're comparing the median to the mean. This boils down to the following:

    a = np.array([1, 1, 2, 4])
    

    ecdf is just the second element (1). While the mean is (4 + 2 + 1 + 1) / 4 == 2.