I have a homework assignment that I was doing with Minitab to find quartiles and the interquartile range of a data set. When I tried to replicate the results using NumPy, the results were different. After doing some googling, I see that there are many different algorithms for computing quartiles: as listed here. I've tried all the different types of interpolation listed in the NumPy docs for the percentile function but none of them match minitab's algorithm. Is there any lazy solution to achieve the minitab algorithm with NumPy or will I just need to roll out my own code and implement the algorithm?
Sample code:
import pandas as pd
import numpy as np
terrestrial = Series([76.5,6.03,3.51,9.96,4.24,7.74,9.54,41.7,1.84,2.5,1.64])
aquatic = Series([.27,.61,.54,.14,.63,.23,.56,.48,.16,.18])
df = DataFrame({'terrestrial' : terrestrial, 'aquatic' : aquatic})
This is the method I used with NumPy
q75,q25 = np.percentile(df.aquatic, [75,25], interpolation='linear')
iqr = q75 - q25
The results from Minitab are different:
Descriptive Statistics: aquatic, terrestrial
Variable Q1 Q3 IQR
aquatic 0.1750 0.5725 0.3975
terrestrial 2.50 9.96 7.46
Here's an attempt to implement Minitab's algorithm. I've written these functions assuming that you've already dropped missing observations from the series a
:
# Drop missing obs
x = df.aquatic[~ pd.isnull(df.aquatic)]
def get_quartile1(a):
a = a.sort(inplace=False)
pos1 = (len(a) + 1) / 4.0
round_pos1 = int(np.floor((len(a) + 1) / 4.0))
first_part = a.iloc[round_pos1 - 1]
extra_prop = pos1 - round_pos1
interp_part = extra_prop * (a.iloc[round_pos1] - first_part)
return first_part + interp_part
get_quartile1(x)
Out[84]: 0.17499999999999999
def get_quartile3(a):
a = a.sort(inplace=False)
pos3 = (3 * len(a) + 3) / 4.0
round_pos3 = round((3 * len(a) + 3) / 4)
first_part = a.iloc[round_pos3 - 1]
extra_prop = pos3 - round_pos3
interp_part = extra_prop * (a.iloc[round_pos3] - first_part)
return first_part + interp_part
get_quartile3(x)
Out[86]: 0.57250000000000001