Search code examples
sqlsql-server

With Cube clause generating duplicate rows


A 'with cube' clause is generating duplicate rows, and I don't understand why. Here is an example:

CREATE TABLE venta_mes_hist (
    tienda_id INT,
    empresa_id INT,
    dia_id DATE,
    total_linea_brut DECIMAL(18, 2)
);

CREATE TABLE lk_tienda (
    tienda_id INT,
    empresa_id INT,
    tienda_desc VARCHAR(50)
);


-- Insert data into lk_tienda
INSERT INTO lk_tienda (tienda_id, empresa_id, tienda_desc)
VALUES 
(1, 1, 'SHIBORI'),
(2, 1, 'NARA'),
(3, 1, 'OSAKA');

-- Insert data into venta_mes_hist
INSERT INTO venta_mes_hist (tienda_id, empresa_id, dia_id, total_linea_brut)
VALUES
(1, 1, '2023-03-01', 100.00),
(1, 1, '2023-03-28', 2544.84),
(1, 1, '2023-04-15', 200.00),
(2, 1, '2023-03-01', 150.00),
(2, 1, '2023-03-28', 3000.00),
(2, 1, '2023-04-15', 250.00),
(3, 1, '2023-03-01', 200.00),
(3, 1, '2023-03-28', 3200.00),
(3, 1, '2023-04-15', 300.00),
(1, 1, '2023-05-10', 500.00);

Then I run this query:

DROP TABLE IF EXISTS #avenut;
SET DATEFORMAT mdy;

select cast(lk_tienda.tienda_desc as varchar(35)) Botiga,venta_mes_hist.dia_id Dia,cast(month(venta_mes_hist.dia_id) as int) Mes,
 sum(venta_mes_hist.total_linea_brut) Venut_valor_brut
 into #avenut
 from  lk_tienda, venta_mes_hist
 where 
 lk_tienda.tienda_id = venta_mes_hist.tienda_id and lk_tienda.empresa_id = venta_mes_hist.empresa_id 
 and venta_mes_hist.empresa_id in ('1') 
 and venta_mes_hist.dia_id >= '03/01/2023'
 and venta_mes_hist.dia_id <= '06/06/2024'
 group by cast(lk_tienda.tienda_desc as varchar(35)), venta_mes_hist.dia_id, cast(month(venta_mes_hist.dia_id) as int)
 with cube 
 order by cast(lk_tienda.tienda_desc as varchar(35)), venta_mes_hist.dia_id, cast(month(venta_mes_hist.dia_id) as int)

When I run:

SELECT * FROM #avenut
WHERE Botiga = 'SHIBORI' AND Dia = '2023-03-28';

It returns two rows:

Botiga  Dia Mes Venut_valor_brut
SHIBORI 2023-03-28  3   2544.84
SHIBORI 2023-03-28  3   2544.84

How is this possible? Of course, I expect one row, as if I understand correctly, the WITH CUBE clause generates totals by inserting rows with null values in some fields.

If you remove the month column, it returns one row. And if you add a new year column while maintaining the month column, it returns 4 rows!


Solution

  • Your issue is that you have added cast(month(venta_mes_hist.dia_id as int)) as one of the group by clauses. It firstly wasn't necessary anyway, as you have added venta_mes_hist.dia_id as a clause also (so you just can do the calculation only in the select), and it's also messing up the cube which will multiply out all the clauses. So it's the equivalent in grouping sets of:

     group by grouping sets (
       (cast(lk_tienda.tienda_desc as varchar(35)), venta_mes_hist.dia_id, cast(month(venta_mes_hist.dia_id) as int)),
       (cast(lk_tienda.tienda_desc as varchar(35)), venta_mes_hist.dia_id),
       (cast(lk_tienda.tienda_desc as varchar(35)), cast(month(venta_mes_hist.dia_id) as int)),
       (venta_mes_hist.dia_id, cast(month(venta_mes_hist.dia_id) as int)),
       (cast(lk_tienda.tienda_desc as varchar(35))),
       (venta_mes_hist.dia_id),
       (cast(month(venta_mes_hist.dia_id) as int)),
       ()
     )
    

    Just remove that clause from the group by. It can also be removed from the order by as it's functionally dependent on

    • Note that with cube is deprecated, use cube() instead.
    • cast(... as varchar(35)) seems unnecessary, you can remove that.
    select
      lk_tienda.tienda_desc Botiga,
      venta_mes_hist.dia_id Dia,
      cast(month(venta_mes_hist.dia_id) as int) Mes,
      sum(venta_mes_hist.total_linea_brut) Venut_valor_brut
     from  lk_tienda, venta_mes_hist
     where 
     lk_tienda.tienda_id = venta_mes_hist.tienda_id and lk_tienda.empresa_id = venta_mes_hist.empresa_id 
     and venta_mes_hist.empresa_id in ('1') 
     and venta_mes_hist.dia_id >= '03/01/2023'
     and venta_mes_hist.dia_id <= '06/06/2024'
     group by cube (
       lk_tienda.tienda_desc,
       venta_mes_hist.dia_id
     )
     order by
       lk_tienda.tienda_desc,
       venta_mes_hist.dia_id;
    

    You can also do this using grouping sets, which some find to be clearer.

     group by grouping sets (
       (lk_tienda.tienda_desc, venta_mes_hist.dia_id),
       (lk_tienda.tienda_desc),
       (venta_mes_hist.dia_id),
       ()
     )
    

    db<>fiddle