Search code examples
mysqlsqlunionlimit

Eliminate duplicates before union


I need to run a query, that select two columns from a big table (3m+ rows, with selecting two columns, the result set is around 6-7m) and returns a list. So I used union to merge the columns into a list and also to eliminate duplicates. The problem is that I cant return the result in one query, I need to partition it, so I applied a LIMIT ?,? to the subqueries, which the application layer sets via Prepared Statements.

SELECT val
FROM 
(
    (SELECT fs.smr as val
    FROM `fr_search` as fs
    ORDER BY val LIMIT ?,?)

    UNION

    (SELECT fs.dmr as val
    FROM `fr_search` as fs
    ORDER BY val LIMIT ?,?)
) as vals
GROUP BY val

The problem: The union eliminates the duplicates, but only after the LIMIT is applied. Meaning If the two query returns 100+100=200 rows and most of them is a duplicate, I only return <200 rows. How can I apply a limit to such a query, that I can return a specific amount of rows? (If I apply the LIMIT after the subqueries, It will take more than two minutes to run, so It will not solve the problem. )


Solution

  • You don't actually need a subquery for this. The following will work for the first 100 rows:

     (SELECT DISTINCT fs.smr as val
      FROM `fr_search` as fs
      ORDER BY val
      LIMIT 100
     )
     UNION
     (SELECT DISTINCT fs.dmr as val
      FROM `fr_search` as fs
      ORDER BY val
      LIMIT 100
     )
     ORDER BY val
     LIMIT 100;
    

    However, once you start putting in offset, it gets more complicated. For the next 100 rows:

     (SELECT DISTINCT fs.smr as val
      FROM `fr_search` as fs
      ORDER BY val
      LIMIT 200
     )
     UNION
     (SELECT DISTINCT fs.dmr as val
      FROM `fr_search` as fs
      ORDER BY val
      LIMIT 200
     )
     ORDER BY val
     LIMIT 100, 100;
    

    The problem is that you don't know where the second set will come from.

    If you actually need to page through the result set, I would suggest that you store it in a temporary table and page off of the temporary table.