Search code examples
pythonpython-3.xpandashashtabledescribe

How pandas describe() - top works when multiple elements have highest count?


Context:

I am trying to understand how top attribute of describe() works in python (3.7.3) pandas (0.24.2).

Efforts hitherto:

  1. I looked into documentation of pandas.DataFrame.describe. It states that:

    If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

    I am trying to understand which part of code exactly attributes to the "arbitrary" output.

  2. I stepped into the code which is being called by describe in-turn. My traceback is as follows:

describe()  #pandas.core.generic
describe_1d()  #pandas.core.generic
describe_categorical_1d()  #pandas.core.generic
value_counts()  #pandas.core.base
value_counts()  #pandas.core.algorithms
_value_counts_arraylike()  #pandas.core.algorithms
# In the above step it uses hash-table, to find keys and their counts
# I am not able to step further, as further implementations are in C.

Sample Trial:

import pandas as pd
sample = pd.Series(["Down","Up","Up","Down"])
sample.describe()["top"]

The above code can give Down or Up randomly, as expected.

Question:

  • Which method in the traceback contributes to the randomness of the output?
  • Does order of keys obtained from hash-table is the reason?

    If yes,

    -- Is it not every-time, same key have same hash and be fetched in same order?

    -- How are keys hashed, iterated (for fetching all keys) and fetched from hash-table?

Any pointer is much appreciated! Thanks in advance :)


Solution

  • As pointed out above, it gives "Down" arbitrarily, but not randomly. On the same machine with the same Pandas version, running the above code should always yield the same result (although it's not guaranteed by the docs, see comments below).

    Let's reproduce what's happening.

    Given this series:

    abc = pd.Series(list("abcdefghijklmnoppqq"))
    

    The value_counts implementation boils down to this:

    import pandas._libs.hashtable as htable
    keys, counts = htable.value_count_object(np.asarray(abc), True)
    result = pd.Series(counts, index=keys)
    

    result:

    g    1
    e    1
    f    1
    h    1
    o    1
    d    1
    b    1
    q    2
    j    1
    k    1
    i    1
    p    2
    n    1
    l    1
    c    1
    m    1
    a    1
    dtype: int64
    

    The order of the result is given by the implementation of the hash table. It is the same for every call.

    You could look into the implementation of value_count_object, which calls build_count_table_object, which uses the khash implementation to get more details about the hashing.

    After computing the table, the value_counts implementation is sorting the results with quicksort. This sort is not stable and with this specially constructed example reorders "p" and "q":

    result.sort_values(ascending=False)
    
    q    2
    p    2
    a    1
    e    1
    f    1
    h    1
    o    1
    d    1
    b    1
    j    1
    m    1
    k    1
    i    1
    n    1
    l    1
    c    1
    g    1
    dtype: int64
    

    Thus there are potentially two factors for the ordering: first the hashing, and second the non-stable sort.

    The displayed top value is then just the first entry of the sorted list, in this case, "q".

    On my machine, quicksort becomes non-stable at 17 entries, this is why I chose the example above.

    We can test the non-stable sort with this direct comparison:

    pd.Series(list("abcdefghijklmnoppqq")).describe().top
    'q'
    
    pd.Series(list(               "ppqq")).describe().top
    'p'