Long-time Excel user turned new Python user here. I have the following data frame of product IDs:
productID sales
6976849 194,518,557
11197085 277,387,647
70689391 197,511,925
70827164 242,995,691
70942756 1,529,319,200
(It doesn't look pretty in the interface, but in Python I've managed to get this into a dataframe with a column for ID and a column for Sales.)
Each product ID has a total number of sales.
What I need is a count of how many products have over 200,000,000 sales, as well as how many products have under 200,000,000 sales.
Bucket Total count Over 200,000,000 x Under 200,000,000 y
In Excel I would do this with a quick Countif function, but I'm not sure how that works in Python.
I'm having a surprisingly difficult time finding how to do this--can anyone point me in the right direction? Even just the names of the functions, so that I can read about them, would be useful!
Thank you!!
Use Pandas
and value_counts
:
import pandas as pd
df = pd.read_excel('data.xlsx')
over, under = df['sales'].gt(200000000).value_counts().tolist()
Output:
>>> over
3
>>> under
2
Step by step:
# Display your data after load file
>>> df
productID sales
0 6976849 194518557
1 11197085 277387647
2 70689391 197511925
3 70827164 242995691
4 70942756 1529319200
# Select the column 'sales'
>>> df['sales']
0 194518557
1 277387647
2 197511925
3 242995691
4 1529319200
Name: sales, dtype: int64
# Sales are greater than 200000000? (IF part of COUNTIF)
>>> df['sales'].gt(200000000)
0 False
1 True
2 False
3 True
4 True
Name: sales, dtype: bool
# Count True (over) and False (under) (COUNT part of COUNTIF)
>>> df['sales'].gt(200000000).value_counts()
True 3
False 2
Name: sales, dtype: int64
# Convert to list
>>> df['sales'].gt(200000000).value_counts().tolist()
[3, 2]
# Set variables over / under
>>> over, under = df['sales'].gt(200000000).value_counts().tolist()
Update
I should also add that there are 100 million rows in the dataset, and I will need more buckets, something like Over 500 million between 200 million and 500 million between 100 million and 200 million under 100 million Can you tell me how I would go about setting the buckets?
Use pd.cut
and value_counts
:
df['buckets'] = pd.cut(df['sales'], right=False, ordered=True,
bins=[0, 100e6, 200e6, 500e6, np.inf],
labels=['under 100M', '100-200M',
'200-500M', 'over 500M'])
>>> df
productID sales buckets
0 6976849 194518557 100-200M
1 11197085 277387647 200-500M
2 70689391 197511925 100-200M
3 70827164 242995691 200-500M
4 70942756 1529319200 over 500M
>>> df.value_counts('buckets', sort=False)
buckets
under 100M 0
100-200M 2
200-500M 2
over 500M 1
dtype: int64