Search code examples
mysqlsqlindexingsubqueryanalytics

Best way to index and query analytic table in MySQL


I have an analytics table (5M rows and growing) with the following structure

Hits 
  id int() NOT NULL AUTO_INCREMENT,
  hit_date datetime NOT NULL,
  hit_day int(11) DEFAULT NULL,
  gender varchar(255) DEFAULT NULL,
  age_range_id int(11) DEFAULT NULL,
  klout_range_id int(11) DEFAULT NULL,
  frequency int(11) DEFAULT NULL,
  count int(11) DEFAULT NULL,
  location_id int(11) DEFAULT NULL,
  source_id int(11) DEFAULT NULL,
  target_id int(11) DEFAULT NULL,

Most queries to the table is to query between two datetimes for a particular sub-set of columns and them sum up all the count column across all rows. For example:

SELECT target.id,
   SUM(CASE gender WHEN 'm' THEN count END) AS 'gender_male',
   SUM(CASE gender WHEN 'f' THEN count END) AS 'gender_female',
   SUM(CASE age_range_id WHEN 1 THEN count END) AS 'age_18 - 20',
   SUM(CASE target_id WHEN 1 then count END) AS 'target_test'
   SUM(CASE location_id WHEN 1 then count END) AS 'location_NY'
FROM Hits
WHERE (location_id =1 or location_id = 2)
  AND (target_id = 40 OR target_id = 22)
  AND cast(hit_date AS date) BETWEEN '2012-5-4'AND '2012-5-10'
GROUP BY target.id

The interesting thing about queries to this table is that the where clause include any permutation of Hit columns names and values since those are what we're filtering against. So the particular query above is getting the # of males and females between the ages of 18 and 20 (age_range_id 1) in NY that belongs to a target called "test". However, there are over 8 age groups, 10 klout ranges, 45 locations, 10 sources etc (all foreign key references).

I currently have an index on hot_date and another one on target_id. What the best way to properly index this table?. Having a composite index on all column fields seems inherently wrong.

Is there any other way to run this query without using a sub-query to sum up all counts? I did some research and this seems to be the best way to get the data-set I need but is there a more efficient way of handling this query?


Solution

  • Here's your optimized query. The idea is to get rid of the ORs and the CAST() function on hit_date so that MySQL can utilize a compound index that covers each of the subsets of data. You'll want a compound index on (location_id, target_id, hit_date) in that order.

    SELECT id, gender_male, gender_female, `age_18 - 20`, target_test, location_NY
    FROM
    (
    SELECT target.id,
       SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
       SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
       SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
       SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
       SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
    FROM Hits
    WHERE (location_id =1)
      AND (target_id = 40)
      AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
    GROUP BY target.id
    
    UNION ALL
    
    SELECT target.id,
       SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
       SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
       SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
       SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
       SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
    FROM Hits
    WHERE (location_id = 2)
      AND (target_id = 22)
      AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
    GROUP BY target.id
    
    UNION ALL
    
    SELECT target.id,
       SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
       SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
       SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
       SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
       SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
    FROM Hits
    WHERE (location_id =1)
      AND (target_id = 22)
      AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
    GROUP BY target.id
    
    UNION ALL
    
    SELECT target.id,
       SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
       SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
       SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
       SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
       SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
    FROM Hits
    WHERE (location_id = 2)
      AND (target_id = 22)
      AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
    GROUP BY target.id
    ) a
    GROUP BY id
    

    If your selection size is so large that this is no improvement, then you may as well keep scanning all rows like you're already doing.

    Note, surround aliases with back ticks, not single quotes, which are deprecated. I also fixed your CASE clauses which had count instead of 1.