Search code examples
pandascountpivot-table

Why pandas pivot_table count over a dataframe with 2 columns works with aggfunc=len, but results in empty dataset with count?


Why I cannot replace the len below with "count" and keep same output behavior?. I mean, it runs without error, but the output is very different, if I change the len to count, I get an empty dataset.

import pandas as pd

data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
        'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}

df = pd.DataFrame(data)

pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc=len)

print(pivot_table)

Results in:

Value      10   20   30   40   50   60   70   80   90
Category                                             
A         1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN  NaN
B         NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN
C         NaN  NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0

size produces same result as len:

pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="size")

Even weirder, I can make "count" work like "size" (or len) if I do this strange workaround: create a copy of Value with a different "ValueCopy" name :

df["ValueCopy"] = df["Value"]
pivot_table = df.pivot_table(index='Category', columns='Value', values='ValueCopy', aggfunc="count")

But if I change the pivot_table call to use "count" without the extra dummy column workaround:

pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="count")

it results in:

Empty DataFrame
Columns: []
Index: [A, B, C]

Questions:

  • Have I hit a Pandas bug?
  • Is this expected Pandas behavior?
  • How does "count" work if it does not call len? (I tried to read the pivot_table Pandas code, but I found it too convoluted)
  • Why creating a dummy copy of the value column makes count work like size (and len) ?
  • Is the fact that count works fine in Polars & DuckDb indicative of this being a defect in Pandas? (see below)

In Polars, "count" works perfectly fine as:

import polars as pl

# Create a DataFrame
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pl.DataFrame(data)

# Pivot the DataFrame
pivot_table = df.pivot(index='Category', columns='Value', values='Value', aggregate_function="count")

# Print the pivot table
print(pivot_table)

results in:

Shape: (3, 10)
┌──────────┬──────┬──────┬──────┬───┬──────┬──────┬──────┬──────┐
│ Category ┆ 10   ┆ 20   ┆ 30   ┆ … ┆ 60   ┆ 70   ┆ 80   ┆ 90   │
│ ---      ┆ ---  ┆ ---  ┆ ---  ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str      ┆ u32  ┆ u32  ┆ u32  ┆   ┆ u32  ┆ u32  ┆ u32  ┆ u32  │
╞══════════╪══════╪══════╪══════╪═══╪══════╪══════╪══════╪══════╡
│ A        ┆ 1    ┆ null ┆ null ┆ … ┆ null ┆ 1    ┆ null ┆ null │
│ B        ┆ null ┆ 1    ┆ null ┆ … ┆ null ┆ null ┆ 1    ┆ null │
│ C        ┆ null ┆ null ┆ 1    ┆ … ┆ 1    ┆ null ┆ null ┆ 1    │
└──────────┴──────┴──────┴──────┴───┴──────┴──────┴──────┴──────┘

Likewise DuckDb SQL pivot count has no trouble counting:

import pandas as pd
import duckdb

# Your existing DataFrame
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
        'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)

# Create a DuckDB connection
con = duckdb.connect()

# Register the DataFrame with DuckDB
con.register('df_pivot', df)

# Perform the pivot operation in DuckDB
query = """

PIVOT df_pivot
ON Value
USING Count(Value)
GROUP BY Category

"""
pivot_table = con.execute(query).fetchdf()

print(pivot_table)

results in:

  Category  10  20  30  40  50  60  70  80  90
0        A   1   0   0   1   0   0   1   0   0
1        B   0   1   0   0   1   0   0   1   0
2        C   0   0   1   0   0   1   0   0   1

Solution

  • In Pandas, pivot_table at some point calls groupby(index+columns).

    In your example that means grouping on the only two existing columns, so there are no values to count.

    When using len in this situation it returns the length of each group as a DataFrame (using DataFrame.__len__, which returns len(self.index)) -- which IMO seems more odd than returning nothing.