Search code examples
pythonexceldata-analysisaggregationcountif

Python data aggregation ("Countifs" function in Excel)


Long-time Excel user turned new Python user here. I have the following data frame of product IDs:

productID              sales
6976849                194,518,557             
11197085               277,387,647
70689391               197,511,925
70827164               242,995,691
70942756               1,529,319,200

(It doesn't look pretty in the interface, but in Python I've managed to get this into a dataframe with a column for ID and a column for Sales.)

Each product ID has a total number of sales.

What I need is a count of how many products have over 200,000,000 sales, as well as how many products have under 200,000,000 sales.

Bucket Total count Over 200,000,000 x Under 200,000,000 y

In Excel I would do this with a quick Countif function, but I'm not sure how that works in Python.

I'm having a surprisingly difficult time finding how to do this--can anyone point me in the right direction? Even just the names of the functions, so that I can read about them, would be useful!

Thank you!!


Solution

  • Use Pandas and value_counts:

    import pandas as pd
    
    df = pd.read_excel('data.xlsx')
    over, under = df['sales'].gt(200000000).value_counts().tolist()
    

    Output:

    >>> over
    3
    
    >>> under
    2
    

    Step by step:

    # Display your data after load file
    >>> df
       productID       sales
    0    6976849   194518557
    1   11197085   277387647
    2   70689391   197511925
    3   70827164   242995691
    4   70942756  1529319200
    
    # Select the column 'sales'
    >>> df['sales']
    0     194518557
    1     277387647
    2     197511925
    3     242995691
    4    1529319200
    Name: sales, dtype: int64
    
    # Sales are greater than 200000000? (IF part of COUNTIF)
    >>> df['sales'].gt(200000000)
    0    False
    1     True
    2    False
    3     True
    4     True
    Name: sales, dtype: bool
    
    # Count True (over) and False (under) (COUNT part of COUNTIF)
    >>> df['sales'].gt(200000000).value_counts()
    True     3
    False    2
    Name: sales, dtype: int64
    
    # Convert to list
    >>> df['sales'].gt(200000000).value_counts().tolist()
    [3, 2]
    
    # Set variables over / under
    >>> over, under = df['sales'].gt(200000000).value_counts().tolist()
    

    Update

    I should also add that there are 100 million rows in the dataset, and I will need more buckets, something like Over 500 million between 200 million and 500 million between 100 million and 200 million under 100 million Can you tell me how I would go about setting the buckets?

    Use pd.cut and value_counts:

    df['buckets'] = pd.cut(df['sales'], right=False, ordered=True,
                           bins=[0, 100e6, 200e6, 500e6, np.inf],
                           labels=['under 100M', '100-200M',
                                   '200-500M', 'over 500M'])
    
    >>> df
       productID       sales    buckets
    0    6976849   194518557   100-200M
    1   11197085   277387647   200-500M
    2   70689391   197511925   100-200M
    3   70827164   242995691   200-500M
    4   70942756  1529319200  over 500M
    
    >>> df.value_counts('buckets', sort=False)
    buckets
    under 100M    0
    100-200M      2
    200-500M      2
    over 500M     1
    dtype: int64