I have a need to take a bounded random user sample that is uniformly distributed across the US geography.
I am wondering if the following approach is reasonable to achieve my goal in Oracle 11g, given that I have zip-code information about the users.
My approach is to use zip-codes to accomplish the geographic distribution across the US.
To ensure the selected zip-codes are uniformly distributed across the US: - I am using the "zone" (first two digits).
To get a random distribution across the zip-code zone:
Lastly, (not included below) I would retrieve a set of random set of users from each zip-code.
select
y.ZIP_CODE
from
(
select
x.ZIP_CODE
,x.ZIP_CODE_ZONE
,row_number()
over (
partition by x.ZIP_CODE_ZONE
order by x.random_num
)
AS row_number
from
(
select
ZIP_CODE
,ZIP_CODE_ZONE
,dbms_random.value(0,100000) as random_num
from
ZIP_CODE_TABLE
where 1=1
and COUNTRY_CODE = 'US'
) x
) y
where
y.row_number <= 10
order by
y.zip_code
,y.row_number
;
Update:
After more research, I found that this approach will not fulfill the goal of getting a "uniformly" distributed set of information across the US. The result is still biased because zip-code zones are "small" near "large" population centers. So the chance of retrieving a user in these small zones is larger than the larger zones, so users cluster together near population centers.
Data from the database is retrieved in unsorted manner. You can get the same results if you have already some data in buffer cache.
select y.ZIP_CODE
from
(
select
ZIP_CODE
, ZIP_CODE_ZONE
, row_number() over (partition by ZIP_CODE_ZONE order by dbms_random.value(0,100000)) AS row_number
from ZIP_CODE_TABLE
where COUNTRY_CODE = 'US'
) y
where y.row_number <= 10
order by
y.zip_code
, y.row_number
;