Search code examples
mysqloptimizationscalingmailing-list

optimising and scaling mysql structure + queries for large mailing groups


So I have a system that stores contacts and allows them to be put into groups. These groups can be defined by criteria (everyone with surname 'smith'), or by explicitly adding / excluding people.

The problem I am having is that when I list the mailing groups, I need to count how many contacts are in each one. This number can change as contacts are added / removed from the contacts table. On small groups / amounts of contacts it is fine, however using 50k ish contacts runs into problems

An example query I use for this is as follows:

SELECT COUNT(c_id) FROM contacts, mgroups
LEFT JOIN mgroups_explicit ON mg_id = me_mg_id
WHERE mgroups.site_id = '10'
AND mg_id = '20'
AND me_c_id = c_id
AND contacts.site_id = '10'
OR (contacts.site_id = '10' AND ( c_tags LIKE '%tag1%')) AND c_id NOT IN
( SELECT mex_c_id FROM mgroups_exclude WHERE c_id = mex_c_id ) GROUP BY c_id 

The criteria table does not feature in this query, as the problem presents itself when large groups are created explicitly, rather than with a criteria. This is required as criteria based groups grow or shrink on the fly as you modify your contacts, where as explicit is generally set in stone. So in this case, if you explicitly add 20k contacts to a group, it adds 20k rows to the table marked with that mg_id as a foreign key.

This basically takes ages / times out / gets the wrong number / generally doesn't work very well. I either need to figure out a more efficient query, or figure out a better way to store everything.

Any ideas?

The 5 main tables that make up the database

contacts - where the actual contacts reside
Field   Type    Null    Default     Comments
c_id    int(8)  No           
site_id     int(6)  No           
c_email     varchar(500)    No           
c_source    varchar(255)    No           
c_subscribed    tinyint(1)  No      0    
c_special   tinyint(1)  No      0    
c_domain    text    No           
c_title     varchar(12)     No           
c_name  varchar(128)    No           
c_surname   varchar(128)    No           
c_company   varchar(128)    No           
c_jtitle    text    No           
c_ad1   text    No           
c_ad2   text    No           
c_ad3   text    No           
c_county    varchar(64)     No           
c_city  varchar(128)    No           
c_postcode  varchar(32)     No           
c_lat   varchar(100)    No           
c_lng   varchar(100)    No           
c_country   varchar(64)     No           
c_tel   varchar(20)     No           
c_mob   varchar(20)     No           
c_dob   date    No           
c_registered    datetime    No           
c_updated   datetime    No           
c_twitter   varchar(255)    No           
c_facebook  varchar(255)    No           
c_tags  text    No           
c_special_1     text    No           
c_special_2     text    No           
c_special_3     text    No           
c_special_4     text    No           
c_special_5     text    No           
c_special_6     text    No           
c_special_7     text    No           
c_special_8     text    No           

mgroups - basic mailing group info
Field   Type    Null    Default     Comments
mg_id   int(8)  No           
site_id     int(6)  No           
mg_name     varchar(255)    No           
mg_created  datetime    No           

mgroups_criteria - criteria for said mailing groups
Field   Type    Null    Default     Comments
mc_id   int(8)  No           
site_id     int(6)  No           
mc_mg_id    int(8)  No           
mc_criteria     text    No           

mgroups_exclude - anyone to exclude from criteria
Field   Type    Null    Default     Comments
mex_id  int(8)  No           
site_id     int(6)  No           
mex_c_id    int(8)  No           
mex_mg_id   int(8)  No           

mgroups_explicit - anyone to explicitly add without the use of criteria
Field   Type    Null    Default     Comments
me_id   int(8)  No           
site_id     int(6)  No           
me_c_id     int(8)  No           
me_mg_id    int(8)  No

And the indexs / explain of query. Must admit, indexes are not my strong point, any improvements?

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   PRIMARY     mgroups     ALL     PRIMARY,mg_id   NULL    NULL    NULL    9   Using temporary; Using filesort
1   PRIMARY     mgroups_explicit    ref     me_mg_id    me_mg_id    4   engine_4.mgroups.mg_id  8750     
1   PRIMARY     contacts    ALL     PRIMARY,c_id    NULL    NULL    NULL    86012   Using where; Using join buffer
2   DEPENDENT SUBQUERY  NULL    NULL    NULL    NULL    NULL    NULL    NULL    Impossible WHERE noticed after reading const table...

Solution

  • Right so I got this answered elsewhere (Huge thanks to Hambut_Bulge), so for the sake of it being useful to anyone else heres the solution:


    First things off you're mixing old and new (ANSI) style joins in the same query. This is considered a bad idea in SQL circles. By old style I mean we write a query with a join along these lines

    SELECT a.column_name, b.column2
    FROM table1 a, second_table b
    WHERE a.id_key = b.fid_key
    AND b.some_other_criteria = 'Y';
    

    In the newer ANSI style we'd rewrite the above to this:

    SELECT a.column_name, b.column2
    FROM table1 a INNER JOIN second_table b ON a.id_key = b.fid_key
    WHERE b.some_other_criteria = 'Y';
    

    Its neater and easier to read which bits are join conditions and which are where clauses. Its also best to get into the habit of using ANSI style as old style support may (at some point) be discontinued.

    Also try and be consistent in your use of dot notation and/or aliases. Again it makes big queries easier to read.

    Back to your problem query, I began by starting to convert it into ANSI style and straight-away noticed that you don't have a join condition between contacts and mgroups. This means that optimizer will create a cross join (also called a cartesian product), which was probably something you don't want to do. The cross join (in case you didn't know) joins every row in the contacts table with every row in the mgroups table. So if you have 50,000 rows in contacts and 20,000 rows in mgroup you're going to get a joined result set containing 1,000,000,000 rows!

    The other thing that is going to slow this query drastically is the subquery on mgroups_exclude. A subquery is executed once for each row in the outer query eg:

    SELECT a.column1
    FROM table1 a
    WHERE a.id_key NOT IN ( SELECT * FROM table2 b WHERE a.id_key = b.fid_key);
    

    Assume that table1 has 2,000,000 rows and table2 has 500,000. For each and every row in the outer query (table1) the database is going to have to do a full scan on the inner query. So to get a result the database will have read 1,000,000,000,000 rows and we may only be interested in 1,000! It will not touch any indexes no matter what.

    To get around this we can use a left join (also called a left outer join) on the two tables.

    SELECT a.column1
    FROM table1 a LEFT JOIN table2 b ON a.id_key = b.fid_key
    WHERE b.fid_key IS NULL;
    

    An outer join does not require each record in the joined tables to have a matching record. So the example above we'd get all the records from table1 even if there is no match on table2. For non-matched records the database returns a NULL and we can test for that in the where clause. Now the optimizer can scan the indexes on the two tables id_key fields (assuming there are any), resulting in a much faster query.

    So, to wrap up. I'd rewrite your orginal query thus:

    SELECT COUNT( a.c_id )
    FROM contacts a
    INNER JOIN mgroups b ON a.c_id = b.mg_id
    LEFT JOIN mgroups_explicit c ON b.mg_id = c.me_mg_id
    LEFT JOIN mgroups_exclude d ON a.c_id = d.mex_c_id
    WHERE b.mg_id = '20'
    AND a.site_id = '10'
    AND a.c_tags LIKE '%tag1%'
    AND d.mex_c_id IS NULL
    GROUP BY c_id;