Search code examples
mysqlsqlgroup-byquery-performancefilesort

MySql GROUP BY using filesort - query optimization


I have a table like this:

CREATE TABLE `purchase` (
  `fact_purchase_id` binary(16) NOT NULL,
  `purchase_id` int(10) unsigned NOT NULL,
  `purchase_id_primary` int(10) unsigned DEFAULT NULL,
  `person_id` int(10) unsigned NOT NULL,
  `person_id_owner` int(10) unsigned NOT NULL,
  `service_id` int(10) unsigned NOT NULL,
  `fact_count` int(10) unsigned NOT NULL DEFAULT '0',
  `fact_type` tinyint(3) unsigned NOT NULL,
  `date_fact` date NOT NULL,
  `purchase_name` varchar(255) DEFAULT NULL,
  `activation_price` decimal(7,2) unsigned NOT NULL DEFAULT '0.00',
  `activation_price_total` decimal(7,2) unsigned NOT NULL DEFAULT '0.00',
  `renew_price` decimal(7,2) unsigned DEFAULT '0.00',
  `renew_price_total` decimal(7,2) unsigned NOT NULL DEFAULT '0.00',
  `activation_cost` decimal(7,2) unsigned DEFAULT '0.00',
  `activation_cost_total` decimal(7,2) unsigned NOT NULL DEFAULT '0.00',
  `renew_cost` decimal(7,2) unsigned DEFAULT '0.00',
  `renew_cost_total` decimal(7,2) unsigned NOT NULL DEFAULT '0.00',
  `date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`fact_purchase_id`),
  KEY `purchase_id_idx` (`purchase_id`),
  KEY `person_id_idx` (`person_id`),
  KEY `person_id_owner_idx` (`person_id_owner`),
  KEY `service_id_idx` (`service_id`),
  KEY `fact_type_idx` (`fact_type`),
  KEY `renew_price_idx` (`renew_price`),
  KEY `renew_cost_idx` (`renew_cost`),
  KEY `renew_price_year_idx` (`renew_price_year`),
  KEY `renew_cost_year_idx` (`renew_cost_year`),
  KEY `date_created_idx` (`date_created`),
  KEY `purchase_id_primary_idx` (`purchase_id_primary`),
  KEY `fact_count` (`fact_count`),
  KEY `renew_price_year_total_idx` (`renew_price_total`),
  KEY `renew_cost_year_total_idx` (`renew_cost_total`),
  KEY `date_fact` (`date_fact`) USING BTREE,
  CONSTRAINT `purchase_person_fk` FOREIGN KEY (`person_id`) REFERENCES `person` (`person_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `purchase_person_owner_fk` FOREIGN KEY (`person_id_owner`) REFERENCES `person` (`person_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
  CONSTRAINT `purchase_service_fk` FOREIGN KEY (`service_id`) REFERENCES `service` (`service_id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

I'm launching this query:

SELECT 
    purchase.date_fact,
    UNIX_TIMESTAMP(purchase.date_fact),
    COUNT(DISTINCT purchase.purchase_id) AS Num
FROM
    purchase
WHERE
    purchase.date_fact >= '2017-01-01'
    AND purchase.date_fact <= '2017-01-31'
    AND purchase.fact_type = 3
    AND purchase.purchase_id_primary IS NULL
GROUP BY purchase.date_fact

The table contains a total of 5.629.670 records and running an EXPLAIN on the query I get these results:

  • rows = 2.814.835
  • possible_keys = fact_type_idx,purchase_id_primary_idx,date_fact
  • key = fact_type_idx
  • key_len = 1
  • ref = const
  • filtered = 25.00
  • Extra = Using index condition;Using where;Using filesort

The query takes 30-35 seconds to be executed. This is too long to wait.

The problem is that the GROUP BY causes filesort to be applied. Applying ORDER BY NULL to the query doesn't change anything.

I could possibly use a covering index, but I just need date_fact in this query: which fields can I use?

How can I avoid filesort on GROUP BY? How can I optimize the query in order to make it faster?

I'm using this table for statistics purposes (OLAP). Maybe is there any better DBMS for this purpose?

I'm running MySql Server 5.7.17.

Thank you


Solution

  • For this query:

    SELECT p.date_fact, UNIX_TIMESTAMP(p.date_fact),
           COUNT(DISTINCT p.purchase_id) AS Num
    FROM purchase p
    WHERE p.date_fact >= '2017-01-01' AND
          p.date_fact <= '2017-01-31' AND
          p.fact_type = 3 AND
          p.purchase_id_primary IS NULL
    GROUP BY p.date_fact;
    

    I would recommend a compound index on (fact_type, purchase_id_primary, date_fact, purchase_id). The first two keys have equality conditions in the WHERE. The third has an inequality, and the fourth allows the index to "cover" the query (all columns in the query are in the index).

    I would also add: if you don't need COUNT(DISTINCT), then don't use it. purchase_id might already be unique in purchase.