Search code examples
oracle-databaserandomzipcodeuniform

Is this the best way to get a bounded, random result, uniformly distributed over a column value?


I have a need to take a bounded random user sample that is uniformly distributed across the US geography.

I am wondering if the following approach is reasonable to achieve my goal in Oracle 11g, given that I have zip-code information about the users.

My approach is to use zip-codes to accomplish the geographic distribution across the US.

To ensure the selected zip-codes are uniformly distributed across the US: - I am using the "zone" (first two digits).

To get a random distribution across the zip-code zone:

  1. For each zip-code I assign a random number.
  2. I then sort the groups of zip-code zones by the random number and select the top N (10 in the following example).

Lastly, (not included below) I would retrieve a set of random set of users from each zip-code.

select
  y.ZIP_CODE
from
  (
  select
    x.ZIP_CODE
   ,x.ZIP_CODE_ZONE
   ,row_number()
    over (
      partition by x.ZIP_CODE_ZONE
      order by x.random_num
    )
     AS row_number
  from
    (
    select 
      ZIP_CODE
     ,ZIP_CODE_ZONE
     ,dbms_random.value(0,100000) as random_num
    from
      ZIP_CODE_TABLE
    where 1=1
      and COUNTRY_CODE = 'US'
    ) x
  ) y
where
  y.row_number <= 10
order by 
  y.zip_code
 ,y.row_number
;

Update:

After more research, I found that this approach will not fulfill the goal of getting a "uniformly" distributed set of information across the US. The result is still biased because zip-code zones are "small" near "large" population centers. So the chance of retrieving a user in these small zones is larger than the larger zones, so users cluster together near population centers.


Solution

  • Data from the database is retrieved in unsorted manner. You can get the same results if you have already some data in buffer cache.

    select y.ZIP_CODE
    from
    (
        select 
              ZIP_CODE
            , ZIP_CODE_ZONE
            , row_number() over (partition by ZIP_CODE_ZONE order by dbms_random.value(0,100000)) AS row_number
        from  ZIP_CODE_TABLE
        where COUNTRY_CODE = 'US'
    ) y
    where y.row_number <= 10
    order by 
          y.zip_code
        , y.row_number
    ;