I need to run a query, that select two columns from a big table (3m+ rows, with selecting two columns, the result set is around 6-7m) and returns a list. So I used union to merge the columns into a list and also to eliminate duplicates. The problem is that I cant return the result in one query, I need to partition it, so I applied a LIMIT ?,?
to the subqueries, which the application layer sets via Prepared Statements.
SELECT val
FROM
(
(SELECT fs.smr as val
FROM `fr_search` as fs
ORDER BY val LIMIT ?,?)
UNION
(SELECT fs.dmr as val
FROM `fr_search` as fs
ORDER BY val LIMIT ?,?)
) as vals
GROUP BY val
The problem: The union eliminates the duplicates, but only after the LIMIT is applied. Meaning If the two query returns 100+100=200 rows and most of them is a duplicate, I only return <200 rows. How can I apply a limit to such a query, that I can return a specific amount of rows? (If I apply the LIMIT after the subqueries, It will take more than two minutes to run, so It will not solve the problem. )
You don't actually need a subquery for this. The following will work for the first 100 rows:
(SELECT DISTINCT fs.smr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 100
)
UNION
(SELECT DISTINCT fs.dmr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 100
)
ORDER BY val
LIMIT 100;
However, once you start putting in offset, it gets more complicated. For the next 100 rows:
(SELECT DISTINCT fs.smr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 200
)
UNION
(SELECT DISTINCT fs.dmr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 200
)
ORDER BY val
LIMIT 100, 100;
The problem is that you don't know where the second set will come from.
If you actually need to page through the result set, I would suggest that you store it in a temporary table and page off of the temporary table.