Search code examples
sqllanguage-agnosticrelational-databaseterminologygaps-and-islands

Is there a formal definition of Gaps and Islands problems? If so, does this problem satisfy it?


It seems that the term "gaps and islands" is overused in my workplace. I recently had essentially the following problem given to me under that banner.

Take a set of data with many rows, each containing lots of data, but in particular, always including a start and stop time column and including many other columns where if one is not NULL then the others are. For example:

Start Time Stop Time Drunkenness Programming Ability
01 60 0100 NULL
10 20 NULL 0450
40 50 NULL 0250

(you may also use the obvious unpivoted equivalent, but don't worry about that)

and convert that data in to a form where all of the data is collapsed in such a way that you can find out what's true at any given time by only needing to look at the single row that corresponds to that time period. So, for the previous example, you want this:

Start Time Stop Time Drunkenness Programming Ability
01 09 0100 NULL
10 20 0100 0450
21 39 0100 NULL
40 50 0100 0250
51 60 0100 NULL

To see that this is what you really want, look at the times in the original rows. Until time 10, only "Dunkenness=0100" is given, so our first row in the result must span from 01 to 09 and contain only Drunkenness info. The next row in the original table spans from 10 to 20, so we must have a row for that time period in the result and it must contain any information that is true at that time (i.e. the "Drunkenness=0100" that is always true and the "Programming Ability = 0450" that is true only between times 10 and 20). As "Programming Ability" is left undefined from time 21 to 39, we must have yet another row where that is NULL. The other two rows are then generated by the same process as the previous rows, so we get the table above.

Is this really a "gaps and islands" problem? Or does the literature give it a different name? I agree that there are gaps in the first dataset and that the results in the final dataset are split in to islands, but that doesn't seem to be what the literature is referring to when it talk of "gaps and islands" problems. The literature seems to care about finding gaps or finding islands, rather than turning gaps in to islands and merging the data like this.

The SQL tag is used because this is a relational database. I am not asking for solutions and I doubt that including an SQL solution in your answer would be enlightening, although they would be welcome. I have therefore not included any SQL code in this question.

I do not believe this question to be opinion-based. I have seen enough coverage of gaps and islands problems to believe that there must be a formal definition of them somewhere. Answers are highly encouraged to provide a formal definition for these problems and a source for it. If this in not a gaps and islands problem, but is actually something else, then please give a name and sourced definition for that.


Solution

  • The condition if one is not NULL then the others are means that your rows are just a different representation of key, value pair. In other words, it un-pivoted variant would look like the following

    Key Value Start End
    Drunkenness 100 01 60
    Programming Ability 450 10 20
    Programming Ability 250 40 50

    Assume that it passes the data integrity checks, that is, there are no overlapping intervals with different value for the same key. Then it looks like a type-2 slowly changing dimension and indeed we can interpret absence of value for Programming Ability between 20 and 40 (exclusive) as NULL.

    However, one can also interpret that data as two separate tables, Drunkenness and Programming Ability merged (via a full join) by start and end date of the intervals.

    SELECT coalesce(a.start,b.start) as start, coalesce(a.end,b.end) as end,
    a.Value, b.Value 
    from a full join b on a.start=b.start and a.end = b.end
    

    So, for example, b is missing data for (10,60) and you get NULL for Programming Ability in the first row there. You can get your second table if you properly join these two table accounting for time interval overlaps.

    SELECT greatest(a.start,b.start) as start, least(a.end,b.end) as end,
    a.Value, b.Value 
    from a full join b on a.start <= b.end and b.start <= a.end
    

    Either way, it is not quite Gaps and Islands problem. In that problem, data has some overlapping intervals possibly with gaps and one has to determine non-overlapping intervals of continuity separated by gaps of discontinuity.