Currently I have the following SQL that identifies records where a conflict (overlap) in geographical areas occurs. I have provided below as a CTE to illustrate table contents.
WITH
A AS
(
SELECT
AREA_ID,
AREA_CATEGORY,
AREA_NAME,
START_ZIP,
END_ZIP
FROM
AREA_DEFINITION
)
,
B AS
(
SELECT
AREA_ID,
AREA_CATEGORY,
AREA_NAME,
START_ZIP,
END_ZIP
FROM
AREA_DEFINITION
)
SELECT
A.AREA_ID,
A.AREA_NAME,
A.AREA_CATEGORY,
A.START_ZIP AS A_START_ZIP,
A.END_ZIP AS A_END_ZIP,
B.START_ZIP AS A_START_ZIP,
B.END_ZIP AS A_END_ZIP
FROM
A
CROSS JOIN
B
WHERE
A.AREA_ID != B.AREA_ID
AND A.AREA_NAME = B.AREA_NAME
AND A.AREA_CATEGORY = B.AREA_CATEGORY
AND((
INT(A.START_ZIP) BETWEEN INT(B.START_ZIP) AND
INT(B.END_ZIP))
OR(
INT(B.START_ZIP) BETWEEN INT(A.START_ZIP) AND
INT(A.END_ZIP)))
When the geographical regions are provided in an input Sequential File (CSV) file, I'd get a file that looks something like this:
AREA_NAME, AREA_CATEGORY, AREA_START_ZIP, AREA_END_ZIP
Domestic, Sales, 00000, 99999
South, Wholesale, 75000, 85000
East, HQ, 00200, 00210
I would like to reject or fail the job when an overlap exists and am unsure how to emulate a cross join in DataStage.
The following is an overview of the job developed to solve this issue.
Initially the job transforms (stage 1) the incoming sequential file (stage 0) Zip Codes to integers, sorts (stage 2) based on Category then Zip Code, then a second transformer (stage 3) to add a row ID to each record.
This gets me to a point where I have a working data set (DS) of the following:
KEY, AREA_NAME, AREA_CATEGORY, AREA_START_ZIP, AREA_END_ZIP
From here, using the observation from Clockwork-Muse I could switch in a JOIN
(stage 4) to join the working data set to itself, with an A
and B
alias on KEY
.
Transforming the data set (DS) to the following:
KEY, AREA_NAME, A_AREA_CATEGORY, A_AREA_START_ZIP, A_AREA_END_ZIP, B_AREA_CATEGORY, B_AREA_START_ZIP, B_AREA_END_ZIP
From here, the final processing stage is an additional transformer that will add an indicator for the overlap based on the following condition:
If ((DS.A_AREA_START_ZIP <= DS.B_AREA_END_ZIP) AND (DS.B_AREA_START_ZIP <= DS.A_AREA_END_ZIP) AND DS.A_AREA_CATEGORY = DS.B_AREA_CATEGORY) Then 'Y' Else 'N'
Leaving us with a data set (DS) with the following information for further processing:
KEY, AREA_NAME, A_AREA_CATEGORY, A_AREA_START_ZIP, A_AREA_END_ZIP, B_AREA_CATEGORY, B_AREA_START_ZIP, B_AREA_END_ZIP, OVERLAP