Search code examples
pythonpandasdataframeseries

Why does the output of Pandas DataFrame.sort_values differ from Series.sort_values?


While teaching, one of my students pointed out that Pandas DataFrame.sort_values returns a different ordering (different tie breaks) to that from the equivalent Series.sort_values. Consider this

>>> import pandas as pd
>>> df = pd.read_csv('https://gist.githubusercontent.com/matthew-brett/806a356bb7b7
... 1f08c5c6d0c5235e2f3d/raw/facb1aab243a33033b46657378f65dcd41542596/business.csv'
... )
>>> df['name'].value_counts().head(6)
name
Peet's Coffee & Tea    20
Starbucks Coffee       13
McDonald's             10
Jamba Juice            10
STARBUCKS               9
Proper Food             9
Name: count, dtype: int64
>>> df.value_counts('name').head(6)
name
Peet's Coffee & Tea    20
Starbucks Coffee       13
McDonald's             10
Jamba Juice            10
Proper Food             9
STARBUCKS               9
Name: count, dtype: int64

Of course, both of these orders are valid, given a not-stable default (quicksort) sort, but it's difficult to see why these would differ in the two cases, given the default method appears to be the same in both cases.


Solution

  • It's different because the strategy is different for both methods.

    To compute the value_counts of a DataFrame, Pandas use a groupby_size but the default behavior of groupby is to sort keys in a lexicographic order by default.

    Series compute value_counts in a more direct way. Series use IndexOpsMixin.value_counts which use pandas.core.algorithms.value_counts_internal

    So to get the same result than a Series, use:

    >>> df.groupby('name', sort=False).size().sort_values(ascending=False).head(10)
    name
    Peet's Coffee & Tea          20
    Starbucks Coffee             13
    McDonald's                   10
    Jamba Juice                  10
    STARBUCKS                     9
    Proper Food                   9
    Mixt Greens/Mixt              8
    Specialty's Cafe & Bakery     8
    Philz Coffee                  7
    The Organic Coup              7
    dtype: int64
    
    >>> df['name'].value_counts().head(10)
    name
    Peet's Coffee & Tea          20
    Starbucks Coffee             13
    McDonald's                   10
    Jamba Juice                  10
    STARBUCKS                     9
    Proper Food                   9
    Mixt Greens/Mixt              8
    Specialty's Cafe & Bakery     8
    Philz Coffee                  7
    The Organic Coup              7
    Name: count, dtype: int64