Search code examples
mysqldistributionsample

Select a distributed sample set of records from a MySQL set of many records


I have a table that has many rows in it, with rows occurring at the rate of 400-500 per minute (I know this isn't THAT many), but I need to do some sort of 'trend' analysis on the data that has been collected over the last 1 minute.

Instead of pulling all records that have been entered and then processing each of those, I would really like to be able to select, say, 10 records - which occur at a -somewhat- even distribution through the timeframe specified.

ID       DEVICE_ID       LA         LO          CREATED         
-------------------------------------------------------------------
1           1           23.4        948.7       2018-12-13 00:00:01
2           2           22.4        948.2       2018-12-13 00:01:01
3           2           28.4        948.3       2018-12-13 00:02:22
4           1           26.4        948.6       2018-12-13 00:02:33
5           1           21.4        948.1       2018-12-13 00:02:42
6           1           22.4        948.3       2018-12-13 00:03:02
7           1           28.4        948.0       2018-12-13 00:03:11
8           2           23.4        948.8       2018-12-13 00:03:12
...                                                             
492         2           21.4        948.4       2018-12-13 00:03:25
493         1           22.4        948.2       2018-12-13 00:04:01
494         1           24.4        948.7       2018-12-13 00:04:02
495         2           27.4        948.1       2018-12-13 00:05:04

Considering this data set, instead of pulling all those rows, I would like to maybe pull a row from the set every 50 records (10 rows for roughly ~500 rows returned).

This does not need to be exact, I just need a sample in which to perform some sort of linear regression on.

Is this even possible? I can do it in my application code if need be, but I wanted to see if there was a function or something in MySQL that would handle this.

Edit Here is the query I have tried, which works for now - but I would like the results more evenly distributed, not by RAND().

SELECT * FROM (
SELECT * FROM (
SELECT t.*, DATE_SUB(NOW(), INTERVAL 30 HOUR) as offsetdate

from tracking t
HAVING created > offsetdate) as parp
ORDER BY RAND()
LIMIT 10) as mastr
ORDER BY id ASC;

Solution

  • Do not order by RAND() as the rand calculated for every row, then reordered and only then you are selecting a few records.

    You can try something like this:

    SELECT
        *
    FROM
        (
            SELECT
                tracking.*
                , @rownum := @rownum + 1 AS rownum
            FROM
                tracking
                , (SELECT @rownum := 0) AS dummy
            WHERE
                created > DATE_SUB(NOW(), INTERVAL 30 HOUR)
        ) AS s
    WHERE
        (rownum % 10) = 0
    

    Index on created is "the must".

    Also, you might consider to use something like 'AND (UNIX_TIMESTAMP(created) % 60 = 0)' which is slightly different from what you wanted, however might be OK (depends on your insert distribution)