I have a database of roughly 3 million rows (four or five columns, integers and strings). I'm wondering what is more optimal to do:
A) use DISTINCT on a column of integers and then use .map(:some_column)
to grab all unique values of :some_column
or
B) same as above, but use .limit(100)
on each call, and move through each 100 rows, doing what needs to be done with each set of 100 rows.
Does it make a difference?
Depends on how many unique values there are in the 3 million rows. If there are only 10,000 unique values, then the map(:some_column)
is fine. If there are 3 million unique values, it's going to be painful. You definitely want to avoid a limit/offset approach. The fastest approach for a large number of unique values would probably be to use the streaming support in sequel_pg, or for even more speed drop down a level and use the ruby-pg driver directly (streaming there as well).