python excel data-analysis aggregation countif

Python data aggregation ("Countifs" function in Excel)

Long-time Excel user turned new Python user here. I have the following data frame of product IDs:

productID              sales
6976849                194,518,557             
11197085               277,387,647
70689391               197,511,925
70827164               242,995,691
70942756               1,529,319,200

(It doesn't look pretty in the interface, but in Python I've managed to get this into a dataframe with a column for ID and a column for Sales.)

Each product ID has a total number of sales.

What I need is a count of how many products have over 200,000,000 sales, as well as how many products have under 200,000,000 sales.

Bucket Total count Over 200,000,000 x Under 200,000,000 y

In Excel I would do this with a quick Countif function, but I'm not sure how that works in Python.

I'm having a surprisingly difficult time finding how to do this--can anyone point me in the right direction? Even just the names of the functions, so that I can read about them, would be useful!

Thank you!!

Solution

Use Pandas and value_counts:

import pandas as pd

df = pd.read_excel('data.xlsx')
over, under = df['sales'].gt(200000000).value_counts().tolist()

Output:

>>> over
3

>>> under
2

Step by step:

# Display your data after load file
>>> df
   productID       sales
0    6976849   194518557
1   11197085   277387647
2   70689391   197511925
3   70827164   242995691
4   70942756  1529319200

# Select the column 'sales'
>>> df['sales']
0     194518557
1     277387647
2     197511925
3     242995691
4    1529319200
Name: sales, dtype: int64

# Sales are greater than 200000000? (IF part of COUNTIF)
>>> df['sales'].gt(200000000)
0    False
1     True
2    False
3     True
4     True
Name: sales, dtype: bool

# Count True (over) and False (under) (COUNT part of COUNTIF)
>>> df['sales'].gt(200000000).value_counts()
True     3
False    2
Name: sales, dtype: int64

# Convert to list
>>> df['sales'].gt(200000000).value_counts().tolist()
[3, 2]

# Set variables over / under
>>> over, under = df['sales'].gt(200000000).value_counts().tolist()

Update

I should also add that there are 100 million rows in the dataset, and I will need more buckets, something like Over 500 million between 200 million and 500 million between 100 million and 200 million under 100 million Can you tell me how I would go about setting the buckets?

Use pd.cut and value_counts:

df['buckets'] = pd.cut(df['sales'], right=False, ordered=True,
                       bins=[0, 100e6, 200e6, 500e6, np.inf],
                       labels=['under 100M', '100-200M',
                               '200-500M', 'over 500M'])

>>> df
   productID       sales    buckets
0    6976849   194518557   100-200M
1   11197085   277387647   200-500M
2   70689391   197511925   100-200M
3   70827164   242995691   200-500M
4   70942756  1529319200  over 500M

>>> df.value_counts('buckets', sort=False)
buckets
under 100M    0
100-200M      2
200-500M      2
over 500M     1
dtype: int64