I have a table that holds scan datetime values. I am wanting to find the start and stop scan time of the users from the main portion of scanning. The issue is that a user may perform some checks before or after the bulk of the scanning and generate a few more scans. The data might look as below.
....
| 2020-04-01 19:48:05 |
| 2020-04-01 19:48:22 |
| 2020-04-01 19:48:23 |
| 2020-04-01 19:48:48 |
| 2020-04-01 19:48:49 |
| 2020-04-01 20:45:33 |
+---------------------+
If I group by the date and grab the min/max of these values my time elapsed will be much large than the actual. In the case above the max would add almost 1 hour of extra time, which was not really spent scanning.
SELECT date, MIN(datetime), MAX(datetime) FROM table GROUP BY date
There might be 1 extra scan or there might be several scans at the beginning or the end of the data so throwing out the first and last data points is not really an option.
Hmmm . . . I think this is a gap and islands problem. You need some definition of when an outlier occurs. Say it is 5 minutes:
select min(datetime), max(datetime), count(*) as num_scans
from (select t.*,
sum(case when prev_datetime > datetime - interval 5 minute then 0 else 1 end) over (order by datetime) as grp
from (select t.*,
lag(datetime) over (order by datetime) as prev_datetime
from t
) t
) t
group by grp;
I'm not sure how you distinguish actual scans from the outliers. Perhaps if there is more than one row or so. If that is the case, you can remove the outliers with logic such as having count(*) > 1
.