Search code examples
sqlcompressionquery-optimizationpostal-code

Building postcode address lookup based on Royal Mail PAF Raw data


I am working on custom build software for postcode lookup based on Royal Mail PAF data. Main purpose of that software is to replace Quick Address (Third party software vendor).

I have a few questions

  1. How come Quick Address data files including indexes are under 500MB whereas if you look at PAF raw data it's over 2.50GB. What cleanup and compression techniques they have performed on raw data to achieve that. My imported Db size is 2.50GB (sqlite). I have to use some free/open source Db and paid Db is not my option here.

  2. There are 28 million records. How can I improve search by organization name or town for example considering it can be performed using "LIKE" statement?

Any idea?


Solution

  • Don't store information that you dont need such as DPS, occupancy and the various company flags

    Rather than hold 28million addresses you could hold 1.8m addresses one for each postcode and have a list of devivery points for each postcode (i.e. house number, house/building names)

    I'm not sure which version of the PAF you have, the relational version with keys or expanded version.

    The keyed version will reduce file size as you just need to have addresses made up of numbers pointing to lookup tables for locality, street , street end etc. But using keys in your addresses will not help searching by org or town name.

    Views will help format your output address from keys. Make sure that the database you use have views that can use indexes otherwise youl'll end up table scanning.

    What I have done in the past is indexed the PAF using the full text search engine sphinx http://sphinxsearch.com/ which gives you very powerfull search (including partial words & fuzzy match) on whatever words you decide to index. Try all words in address. The result from sphinx is a list of keys that you can be used to iterate thru a sql result set. The sql query can be against a address table of keys that can be used to build the full address from lookup tables. The sphinx index build is remarkably fast and produces a reasoably small index size.

    Mysql might be a better fit than sqlite for a database of this size.

    Other things to consider. Are you doing batch processing or just transactional – forget sphinx for batch processing. Frequency of update. If you dont update monthly you will become hopelssly out of date in a very short time.

    Note: If you have the keyed version of the PAF there are some horrible rules for formatting addresses and many undocumented exceptions.