Search code examples
sqldb2datastage

How to determine existence of overlaps in Zip Code ranges (DataStage 8.5)


Currently I have the following SQL that identifies records where a conflict (overlap) in geographical areas occurs. I have provided below as a CTE to illustrate table contents.

   WITH
    A AS
    (
         SELECT
            AREA_ID,
                        AREA_CATEGORY,
            AREA_NAME,
            START_ZIP,
            END_ZIP
           FROM
            AREA_DEFINITION
    )
    ,
    B AS
    (
         SELECT
            AREA_ID,
                        AREA_CATEGORY,
            AREA_NAME,
            START_ZIP,
            END_ZIP
           FROM
            AREA_DEFINITION
    ) 
 SELECT
    A.AREA_ID,
    A.AREA_NAME,
        A.AREA_CATEGORY,
    A.START_ZIP      AS A_START_ZIP,
    A.END_ZIP        AS A_END_ZIP,
    B.START_ZIP      AS A_START_ZIP,
    B.END_ZIP        AS A_END_ZIP
   FROM
    A
CROSS JOIN
    B
  WHERE
    A.AREA_ID != B.AREA_ID
    AND A.AREA_NAME = B.AREA_NAME
    AND A.AREA_CATEGORY = B.AREA_CATEGORY
    AND((
            INT(A.START_ZIP) BETWEEN INT(B.START_ZIP) AND
            INT(B.END_ZIP))
        OR(
            INT(B.START_ZIP) BETWEEN INT(A.START_ZIP) AND
            INT(A.END_ZIP)))

When the geographical regions are provided in an input Sequential File (CSV) file, I'd get a file that looks something like this:

AREA_NAME, AREA_CATEGORY, AREA_START_ZIP, AREA_END_ZIP
Domestic, Sales, 00000, 99999
South, Wholesale, 75000, 85000
East, HQ, 00200, 00210 

I would like to reject or fail the job when an overlap exists and am unsure how to emulate a cross join in DataStage.


Solution

  • The following is an overview of the job developed to solve this issue.

    Initially the job transforms (stage 1) the incoming sequential file (stage 0) Zip Codes to integers, sorts (stage 2) based on Category then Zip Code, then a second transformer (stage 3) to add a row ID to each record.

    This gets me to a point where I have a working data set (DS) of the following:

    KEY, AREA_NAME, AREA_CATEGORY, AREA_START_ZIP, AREA_END_ZIP
    

    From here, using the observation from Clockwork-Muse I could switch in a JOIN (stage 4) to join the working data set to itself, with an A and B alias on KEY.

    Transforming the data set (DS) to the following:

    KEY, AREA_NAME, A_AREA_CATEGORY, A_AREA_START_ZIP, A_AREA_END_ZIP, B_AREA_CATEGORY, B_AREA_START_ZIP, B_AREA_END_ZIP
    

    From here, the final processing stage is an additional transformer that will add an indicator for the overlap based on the following condition:

    If ((DS.A_AREA_START_ZIP <= DS.B_AREA_END_ZIP) AND (DS.B_AREA_START_ZIP <= DS.A_AREA_END_ZIP) AND DS.A_AREA_CATEGORY = DS.B_AREA_CATEGORY) Then 'Y' Else 'N'
    

    Leaving us with a data set (DS) with the following information for further processing:

    KEY, AREA_NAME, A_AREA_CATEGORY, A_AREA_START_ZIP, A_AREA_END_ZIP, B_AREA_CATEGORY, B_AREA_START_ZIP, B_AREA_END_ZIP, OVERLAP