Search code examples
sqlpostgresqldata-analysis

How to count number of people for each month, who was on internship, if we know dates of internship


First of all, sorry for bad title - I can't figure out how to write generalized formulation of my problem. I have a table in PostgreSQL with users and dates of their internships. It looks like this:

user_id start end
1 December 22, 2019 June 29, 2020
2 March 8, 2020 September 8, 2020
3 May 21, 2020 November 21, 2020

From this I need to calculate for each month, how many people were on internship during this month. I only need to calculate full months (if internship actually started on December 22, 2019, I will calculate from January 2022. If internship were finished at June 29, 2020, I will calculate till May 2020. Finally I need this table:

Month Count
Jan-20 1
Feb-20 1
Mar-20 1
Apr-20 2
May-20 2
Jun-20 2
Jul-20 2
Aug-20 2
Sep-20 1
Oct-20 1

For making it absolutely clear, this is how I got it:

Month user_1 user_2 user_3 Count
Jan-20 1 1
Feb-20 1 1
Mar-20 1 1
Apr-20 1 1 2
May-20 1 1 2
Jun-20 1 1 2
Jul-20 1 1 2
Aug-20 1 1 2
Sep-20 1 1
Oct-20 1 1

My idea is to:

  1. Reshape my initial table, so it will look like this:
user_id date event
1 December 22, 2019 start
1 June 29, 2020 end
2 March 8, 2020 start
2 September 8, 2020 end
3 May 21, 2020 start
3 November 21, 2020 end
  1. Generate series between each start and end event:
user_id month
1 Jan-20
1 Feb-20
1 Mar-20
1 Apr-20
1 May-20
2 Apr-20
2 May-20
2 Jun-20
2 Jul-20
2 Aug-20
3 Jun-20
3 Jul-20
3 Aug-20
3 Sep-20
3 Oct-20
  1. Using count() GROUP BY month

Unfortunately, I have problems with 1 and 2 clauses.

I don't know how to reshape the table in PostgreSQL. In Pandas I would use 'stack' function. For my case I can't find the appropriate function.

Even if I can reshape it, I don't understand how to make series of month for each user (shown above).

Please advise, what can be done here to solve my problem?


Solution

  • this query should do the job considering your table as test :

    SELECT to_char(d.date, 'Mon-YY') AS month, count(*) AS count
      FROM
         ( SELECT generate_series(date_trunc('month', min(start_date)), date_trunc('month', max(end_date)), interval '1 month') :: date AS date
             FROM test
         ) AS d
     INNER JOIN test AS t
        ON daterange(t.start_date, t.end_date, '[]') && daterange(d.date, (d.date + interval '1 month') :: date)
     WHERE daterange(t.start_date, t.end_date, '[]') @> daterange(d.date, (d.date + interval '1 month') :: date)
     GROUP BY d.date
    
    • The first sub query calculate the months covered in table test.
    • The JOIN clause calculates the intersections between the months and the date interval for each user
    • The WHERE clause filters the rows where the date interval for a user corresponds to a full months.

    Result :

    month count
    Jan-20 1
    Feb-20 1
    Mar-20 1
    Apr-20 2
    May-20 2
    Jun-20 2
    Jul-20 2
    Aug-20 2
    Sep-20 1
    Oct-20 1

    see results in dbfiddle