Search code examples
sqldatabasepostgresqlregionpopulation

Population by United States Region


Using SQL Let’s say we have a dataset of population by state (e.g., Vermont, 623,251, and so on), but we want to know the population by United States region (e.g., Midwest, 68,985,454). Could you describe how you would go about doing that?

Dataset from census.gov

Where I'm stuck at

--First I created a table with a state and population column.

CREATE TABLE states (
    state VARCHAR(20),
    population INT
);

--Then I uploaded a CSV file from census.gov that I cleaned up.

SELECT * FROM states;

--Created a temporary table to add in the region column.

DROP TABLE IF EXISTS temp_Regions;

CREATE TEMP TABLE temp_Regions (
    state VARCHAR(20),
    state_pop INT,
    region VARCHAR(20),
    region_pop INT
);

INSERT INTO temp_Regions
SELECT state, population
FROM states;

--Used CASE WHEN statements to put states in to their respective regions.

SELECT state,
    CASE WHEN state IN ('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania') THEN 'Northeast'
         WHEN state IN ('Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota') THEN 'Midwest'
         WHEN state IN ('Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas') THEN 'South'
         WHEN state IN ('Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington') THEN 'West'
    END AS region, state_pop, region_pop
    FROM temp_Regions;

--Now I'm stuck at this point. I'm unable to get data into the region_pop column. How do I get the sum of the populations by U.S. Region?

Let me know if you need further clarification on things. Thanks for your help y'all!


Solution

  • You can make use of analytical function sum() over(partition by) to achieve this

    with data
     as (
        SELECT state
              ,CASE WHEN state IN ('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania') THEN 'Northeast'
                 WHEN state IN ('Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota') THEN 'Midwest'
                 WHEN state IN ('Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas') THEN 'South'
                 WHEN state IN ('Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington') THEN 'West'
                END AS region
                , state_pop
            FROM temp_Regions
          )
    select state
           ,region
           ,state_pop
           ,sum(state_pop) over(partition by region) as region_population
      from data