Search code examples
mysqloptimizationderived-tablein-clause

Optimizing a MySQL query with a large IN() clause or join on derived table


Let's say I need to query the associates of a corporation. I have a table, "transactions", which contains data on every transaction made.

CREATE TABLE `transactions` (
  `transactionID` int(11) unsigned NOT NULL,
  `orderID` int(11) unsigned NOT NULL,
  `customerID` int(11) unsigned NOT NULL,
  `employeeID` int(11) unsigned NOT NULL, 
  `corporationID` int(11) unsigned NOT NULL,
  PRIMARY KEY (`transactionID`),
  KEY `orderID` (`orderID`),
  KEY `customerID` (`customerID`),
  KEY `employeeID` (`employeeID`),
  KEY `corporationID` (`corporationID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

It's fairly straightforward to query this table for associates, but there's a twist: A transaction record is registered once per employee, and so there may be multiple records for one corporation per order.

For example, if employees A and B from corporation 1 were both involved in selling a vacuum cleaner to corporation 2, there would be two records in the "transactions" table; one for each employee, and both for corporation 1. This must not affect the results, though. A trade from corporation 1, regardless of how many of its employees were involved, must be treated as one.

Easy, I thought. I'll just make a join on a derived table, like so:

SELECT corporationID FROM transactions JOIN (SELECT DISTINCT orderID FROM transactions WHERE corporationID = 1) AS foo USING (orderID)

The query returns a list of corporations who have been involved in trades with corporation 1. That's exactly what I need, but it's very slow because MySQL can't use the corporationID index to determine the derived table. I understand that this is the case for all subqueries/derived tables in MySQL.

I've also tried to query a collection of orderIDs separately and use a ridiculously large IN() clause (typhically 100 000+ IDs), but as it turns out MySQL has issues using indices on ridiculously large IN() clauses as well and as a result the query time does not improve.

Are there any other options available, or have I exhausted them both?


Solution

  • If I understand your requirement, you could try this.

    select distinct t1.corporationID
    from transactions t1
    where exists (
        select 1
        from transactions t2
        where t2.corporationID =  1
        and t2.orderID = t1.orderID)
    and t1.corporationID != 1;
    

    or this:

    select distinct t1.corporationID
    from transactions t1
    join transactions t2
    on t2.orderID = t1.orderID
    and t1.transactionID != t2.transactionID
    where t2.corporationID = 1
    and t1.corporationID != 1;