Search code examples
mysqlcountright-join

MySQL Right Join Data Not Counting Correctly


I have this data:

+-------------+--------------------+-----------------------+
| employee_id | assignment_started | assignment_terminated |
+-------------+--------------------+-----------------------+
|           1 | 2018-07-01         | (NULL)                |
|           2 | 2018-09-01         | (NULL)                |
|           3 | 2018-10-13         | (NULL)                |
|           4 | 2018-10-13         | (NULL)                |
|           5 | 2018-10-15         | 2019-07-17            |
|           6 | 2018-11-01         | (NULL)                |
|           7 | 2019-01-14         | (NULL)                |
|           8 | 2019-01-24         | (NULL)                |
|           9 | 2019-07-01         | 2019-07-30            |
+-------------+--------------------+-----------------------+

I want to count employees by month that are under assignment. To determine if the employee is under assignment, I need to check if the date I am looking for is between assigment_started and assignment terminated. But if assignmen_termiated is null I set it on NOW().

Also, I have a date range I need to check. So if I have a date range from 2018-01-01 until 2019-07-30 I need to count employees by every month and if there are no employees under assignment on some months I should have 0 value as count.

To create DATE RANGE MONTHS I use this code:

select DISTINCT CONCAT(YEAR(gen_date),' ',MONTHNAME(gen_date)) AS month_name FROM 
(select adddate('1970-01-01',t4*10000 + t3*1000 + t2*100 + t1*10 + t0) gen_date FROM 
(select 0 t0 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0, 
(select 0 t1 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1, 
(select 0 t2 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2, 
(select 0 t3 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3, 
(select 0 t4 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v 
WHERE gen_date between '2018-01-01 00:00:00' and '2019-08-31 23:59:59'

What I get from this is:

+-------------+
| month_name  | 
+-------------+
|2018 January |
|2018 February| 
|2018 March   | 
|2018 April   | 
|         ... | 
|         ... |
|         ... |
|2019 August  | 
+-------------+

From the data above you will see that until 2018 July I would have employee count 0, and for 2018 July I would have 1 employee, for example on 2018 September I should have count 5 because there are 5 employees that were working on that month.

To shorten the question, I use this code to achieve what I need, but for some reason count results are incorrect... I am trying to figure this out, but no idea why am I getting this results that you can find in fiddle below.

SELECT calendar.month_name, COUNT(employee_id) AS emp_count
FROM job_order_employees
RIGHT JOIN (select DISTINCT CONCAT(YEAR(gen_date),' ',MONTHNAME(gen_date)) AS month_name FROM 
(select adddate('1970-01-01',t4*10000 + t3*1000 + t2*100 + t1*10 + t0) gen_date FROM 
(select 0 t0 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0, 
(select 0 t1 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1, 
(select 0 t2 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2, 
(select 0 t3 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3, 
(select 0 t4 union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v 
WHERE gen_date between '2018-01-01 00:00:00' and '2019-08-31 23:59:59') as calendar
ON STR_TO_DATE(CONCAT(calendar.month_name,'01'),'%Y %M %d') BETWEEN job_order_employees.assignment_started AND IFNULL(job_order_employees.assignment_terminated,NOW())
GROUP BY calendar.month_name
ORDER BY STR_TO_DATE(calendar.month_name,'%Y %M') 

Here is some sample data:

-- Dumping structure for table d-works-test.job_order_employees
CREATE TABLE IF NOT EXISTS `job_order_employees` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `employee_id` int(10) unsigned NOT NULL,
  `assignment_started` date NOT NULL,
  `assignment_terminated` date DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

-- Dumping data for table d-works-test.job_order_employees: ~14 rows (approximately)
/*!40000 ALTER TABLE `job_order_employees` DISABLE KEYS */;
INSERT INTO `job_order_employees` 
(`id`
, `employee_id`
,`assignment_started`
, `assignment_terminated`) VALUES
(1, 1,'2019-05-29', NULL),
(2, 2,'2018-09-19', NULL),
(3, 3,'2018-07-01', NULL),
(4, 4, '2018-10-13', NULL),
(5, 5, '2018-10-13', NULL),
(6, 6, '2019-02-01', NULL),
(7, 7, '2019-01-14', NULL),
(8, 8, '2018-11-01', NULL),
(9, 8, '2019-01-01', NULL),
(10, 9, '2019-02-01', NULL),
(11, 9, '2019-01-24', NULL),
(12, 9, '2018-12-31', NULL),
(13, 10, '2018-10-13', '2019-07-17'),
(14, 10, '2019-07-01', '2019-07-17');

And DB Fiddle of same: https://www.db-fiddle.com/f/8dUFx1DWiyypbkx9s2cYyG/1

Thank you in advance for your help!


Solution

  • You could probably simplify your logic quite a bit by holding off formatting the month strings until the final step; you're doing a lot of work back converting for something only relevant to the final formatting.

    This will also help because then you can define the inclusive start and exclusive end of each month, like so:

    SELECT adddate('1970-01-01',t4*10000 + t3*1000 + t2*100 + t1*10 + t0) gen_date FROM (stuff) v

    Then, use that like so:

    SELECT [format rangestart here], COUNT(employee_id) AS emp_count
    FROM (
       SELECT DISTINCT gen_date AS rangestart, gen_date + INTERVAL 1 MONTH AS rangeend 
       FROM v
       WHERE gen_date BETWEEN '2018-01-01 00:00:00' AND'2019-08-31 23:59:59'
    ) as calendar
    LEFT JOIN job_order_employees AS joe
       ON IFNULL(joe.assignment_terminated,NOW()) >= calendar.rangestart
       AND joe.assignment_started <= calendar.rangeend
    GROUP BY calendar.rangestart
    ORDER BY calendar.rangestart 
    ;
    

    The join logic (overlap checking condition) looks a bit weird until you realize where it comes from. It's the simplification of "not ones that don't overlap".

    NOT (ended < range_start || started > range_end) simplifies to ended >= range_start && started <= range_end


    Edit: The above wrongly assumed the subquery was generating every month; the following should work

    calendar query (this will cover about 83 years, you can add another t# table with a multiplier of 1000 to get 833 years worth):

    SELECT '1970-01-01' + INTERVAL t0 + t1 * 10 + t2 * 100 MONTH AS start_date
        , '1970-01-01' + INTERVAL 1 + t0 + t1 * 10 + t2 * 100 MONTH AS end_date  
    FROM (SELECT 0 t0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) t0
        , (SELECT 0 t1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) t1
        , (SELECT 0 t2 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) t2
    

    Final query:

    SELECT [format calendar.start_date here]
       , COUNT(employee_id) AS emp_count
    FROM ( 
       *calendar query above goes here* 
    ) as calendar
    LEFT JOIN job_order_employees AS joe
       ON IFNULL(joe.assignment_terminated,NOW()) >= calendar.start_date
       AND joe.assignment_started < calendar.end_date
    WHERE calendar.start_date BETWEEN '2018-01-01 00:00:00' AND '2019-08-31 23:59:59'
    GROUP BY calendar.start_date
    ORDER BY calendar.start_date
    ;
    

    Note: I also changed the operators on the overlap comparison; since the generated end_date is non-inclusive, it should have been NOT (ended < range_start || started >= range_end) which simplifies to ended >= range_start && started < range_end