Search code examples
pandasstatisticspercentile

Clarification about 75th percentile: does it include the boundary?


Imagine we have some data about movie, and there is a column "year of release". Now I execute the describe function on a (pandas) dataframe; For the column year the 75 percentile indicates 2020, which by definition means mean that 75% of the movies were released before 2020.

My question is: is 2020 included or excluded? In other words, in the 75% of the movies below the 75th percentile, is the max release_date 2019 or 2020?

Ps: this is important because I have my 25th percentile at 2017, and I want to make the statement that 50% of the movies were released between 2017 and ? (either 2019 or 2020).


Solution

  • It's likely in between.

    If you have exactly 2020, it probably means that at least one movie from 2020 is included and at least one is excluded (or you have no movie for 2020 and the years before/after threshold give a value ).

    Example:

    #                   bottom 75%          | top 25%
    pd.Series([2014,2015,2016,2017,2018,2019,2020,2021]).quantile(0.75)
    # 2019.25
    
    pd.Series([2014,2015,2016,2017,2018,2020,2020,2021]).quantile(0.75)
    # 2020