Search code examples
pythonsqlpandasdataframewindow-functions

How to write SQL window functions in pandas


Is there an idiomatic equivalent to SQL's window functions in Pandas? For example, what's the most compact way to write the equivalent of this in Pandas?

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

Or this?:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

Solution

  • For the first SQL:

    SELECT state_name,  
           state_population,
           SUM(state_population)
            OVER() AS national_population
    FROM population   
    ORDER BY state_name 
    

    Pandas:

    df.assign(national_population=df.state_population.sum()).sort_values('state_name')
    

    For the second SQL:

    SELECT state_name,  
           state_population,
           region,
           SUM(state_population)
            OVER(PARTITION BY region) AS regional_population
    FROM population    
    ORDER BY state_name
    

    Pandas:

    df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
      .sort_values('state_name')
    

    DEMO:

    In [238]: df
    Out[238]:
       region state_name  state_population
    0       1        aaa               100
    1       1        bbb               110
    2       2        ccc               200
    3       2        ddd               100
    4       2        eee               100
    5       3        xxx                55
    

    national_population:

    In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
    Out[246]:
       region state_name  state_population  national_population
    0       1        aaa               100                  665
    1       1        bbb               110                  665
    2       2        ccc               200                  665
    3       2        ddd               100                  665
    4       2        eee               100                  665
    5       3        xxx                55                  665
    

    regional_population:

    In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
         ...:   .sort_values('state_name')
    Out[239]:
       region state_name  state_population  regional_population
    0       1        aaa               100                  210
    1       1        bbb               110                  210
    2       2        ccc               200                  400
    3       2        ddd               100                  400
    4       2        eee               100                  400
    5       3        xxx                55                   55