Why I cannot replace the len
below with "count"
and keep same output behavior?. I mean, it runs without error, but the output is very different, if I change the len
to count
, I get an empty dataset.
import pandas as pd
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc=len)
print(pivot_table)
Results in:
Value 10 20 30 40 50 60 70 80 90
Category
A 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN
B NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN
C NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
size
produces same result as len:
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="size")
Even weirder, I can make "count" work like "size" (or len) if I do this strange workaround: create a copy of Value with a different "ValueCopy" name :
df["ValueCopy"] = df["Value"]
pivot_table = df.pivot_table(index='Category', columns='Value', values='ValueCopy', aggfunc="count")
But if I change the pivot_table
call to use "count"
without the extra dummy column workaround:
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="count")
it results in:
Empty DataFrame
Columns: []
Index: [A, B, C]
Questions:
"count"
work if it does not call len
? (I tried to read the pivot_table Pandas code, but I found it too convoluted)In Polars, "count" works perfectly fine as:
import polars as pl
# Create a DataFrame
data = {
'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pl.DataFrame(data)
# Pivot the DataFrame
pivot_table = df.pivot(index='Category', columns='Value', values='Value', aggregate_function="count")
# Print the pivot table
print(pivot_table)
results in:
Shape: (3, 10)
┌──────────┬──────┬──────┬──────┬───┬──────┬──────┬──────┬──────┐
│ Category ┆ 10 ┆ 20 ┆ 30 ┆ … ┆ 60 ┆ 70 ┆ 80 ┆ 90 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════╪══════╪══════╪══════╪═══╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ null ┆ null ┆ … ┆ null ┆ 1 ┆ null ┆ null │
│ B ┆ null ┆ 1 ┆ null ┆ … ┆ null ┆ null ┆ 1 ┆ null │
│ C ┆ null ┆ null ┆ 1 ┆ … ┆ 1 ┆ null ┆ null ┆ 1 │
└──────────┴──────┴──────┴──────┴───┴──────┴──────┴──────┴──────┘
Likewise DuckDb SQL pivot count has no trouble counting:
import pandas as pd
import duckdb
# Your existing DataFrame
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
# Create a DuckDB connection
con = duckdb.connect()
# Register the DataFrame with DuckDB
con.register('df_pivot', df)
# Perform the pivot operation in DuckDB
query = """
PIVOT df_pivot
ON Value
USING Count(Value)
GROUP BY Category
"""
pivot_table = con.execute(query).fetchdf()
print(pivot_table)
results in:
Category 10 20 30 40 50 60 70 80 90
0 A 1 0 0 1 0 0 1 0 0
1 B 0 1 0 0 1 0 0 1 0
2 C 0 0 1 0 0 1 0 0 1
In Pandas, pivot_table
at some point calls groupby(index+columns)
.
In your example that means grouping on the only two existing columns, so there are no values to count.
When using len
in this situation it returns the length of each group as a DataFrame (using DataFrame.__len__
, which returns len(self.index)
) -- which IMO seems more odd than returning nothing.