Search code examples
sqlsql-servert-sql

Query to count unique companies for each year within projects that involve collaboration from two or more companies


The dataset looks like this:

Project Company start year end year
Project 1 Company A 2015 2020
Project 1 Company A 2015 2020
Project 1 Company B 2015 2020
Project 1 Company C 2015 2020
Project 2 Company D 2016 2021
Project 2 Company D 2016 2021
Project 2 Company A 2016 2019
Project 3 Company E 2019 2023
Project 3 Company E 2019 2023
Project 3 Company E 2019 2023

I am trying to get a number of unique companies (as they can be part of several projects) for each year, when they are part of a project with collaboration of at least two companies.

And those companies should be counted for each year they are in those projects.

So the result would be:

year unique companies
2015 248
2016 279
2017 423

etc.

I have this data in Excel, haven't found any way to count it (work with Excel 2016).

I tried SQL Server Management Studio. My knowledge of SQL is not so wide, so I asked ChatGPT to help me with the query, we had quite a conversation and tried over 20 queries (as I explained again and shared the errors or unwanted results).

One of the queries we tried is this one. But the result was I believe only counting the company in the year the project started, as some years have very low numbers in the result.

WITH ProjectCompanyCTE AS 
(
    SELECT
        [ID projektu] AS Project_ID,
        [Ičo] AS Company_ID,
        [Rok zahájení] AS Year
    FROM
        dbo.IsvavaiProjects
    WHERE
        [ID projektu] IN (SELECT [ID projektu]
                          FROM dbo.IsvavaiProjects
                          GROUP BY [ID projektu]
                          HAVING COUNT(DISTINCT [Ičo]) >= 2)
)
SELECT
    Year,
    COUNT(DISTINCT Company_ID) AS UniqueCompaniesCount
FROM
    ProjectCompanyCTE
GROUP BY
    Year
ORDER BY
    Year;

Solution

  • You first need to start with a table of years (you can use GENERATE_SERIES or some other method) and join your table to it. That splits out each row into its constituent years.

    Then use a window function to count the number of companies per project. And simply group up by year and count distinct CompanyIDs, filtering the window function to 2 or more.

    SELECT
      ip.Year,
      UniqueCompanies = COUNT(DISTINCT ip.CompanyID)
    FROM (
        SELECT 
          Year = year.value,
          ip.Project_ID,
          ip.CompanyID,
          CompaniesPerProject = COUNT(*) OVER (PARTITION BY year.value, ip.Project_ID)
        FROM GENERATE_SERIES(2015, 2023) year
        JOIN dbo.IsvavaiProjects ip
          ON year.value BETWEEN ip.StartYear AND ip.EndYear
    ) ip
    WHERE ip.CompaniesPerProject >= 2
    GROUP BY
      ip.Year;