Search code examples
sqlsortingpostal-code

Sort Postcode for menu/list


I need to sort a list of UK postcodes in to order.

Is there a simple way to do it?

UK postcodes are made up of letters and numbers:

see for full info of the format: http://en.wikipedia.org/wiki/UK_postcodes

But my problem is this a simple alpha sort doesn't work because each code starts with 1 or two letters letters and then is immediately followed by a number , up to two digits, then a space another number then a letter. e.g. LS1 1AA or ls28 1AA, there is also another case where once the numbers in the first section exceed 99 then it continues 9A etc.

Alpha sort cause the 10s to immediately follow the 1:

...
LS1 9ZZ
LS10 1AA
...
LS2 

I'm looking at creating a SQL function to convert the printable Postcode into a sortable postcode e.g. 'LS1 9ZZ' would become 'LS01 9ZZ', then use this function in the order by clause.

Has anybody done this or anything similar already?


Solution

  • You need to think of this as a tokenization issue so SW1A 1AA should tokenize to:

    • SW
    • 1
    • A
    • 1AA

    (although you could break the inward part down into 1 and AA if you wanted to)

    and G12 8QT should tokenize to:

    • G
    • 12
    • (empty string)
    • 8QT

    Once you have broken the postcode down into those component parts then sorting should be easy enough. There is an exception with the GIR 0AA postcode but you can just hardcode a test for that one

    edit: some more thoughts on tokenization

    For the sample postcode SW1A 1AA, SW is the postcode area, 1A is the postcode district (which we'll break into two parts for sorting purposes), 1 is the postcode sector and AA is the unit postcode.

    These are the valid postcode formats (source: Royal Mail PAF user guide page 8 - link at bottom of this page):

    AN NAA
    AAN NAA
    ANN NAA
    ANA NAA
    AAA NAA (only for GIR 0AA code)
    AANN NAA
    AANA NAA

    So a rough algorithm would be (assuming we want to separate the sector and unit postcode):

    • code = GIR 0AA? Tokenize to GI/R/ /0/AA (treating R as the district simplifies things)
    • code 5 letters long e.g G1 3AF? Tokenize to G/1/ /3/AF
    • code 6 letters long with 3rd character being a letter e.g. W1P 1HQ? Tokenize to W/1/P/1/HQ
    • code 6 letters long with 2nd character being a letter e.g. CR2 6XH? Tokenize to CR/2/ /6/XH
    • code 7 letters long with 4th character being a letter e.g. EC1A 1BB? Tokenize to EC/1/A/1/BB
    • otherwise e.g. TW14 2ZZ, tokenize to TW/14/ /2/ZZ

    If the purpose is to display a list of postcodes for the user to choose from then I would adopt Neil Butterworth's suggestion of storing a 'sortable' version of the postcode in the database. The easiest way to create a sortable version is to pad all entries to nine characters:

    • two characters for the area (right-pad if shorter)
    • two for the district number (left-pad if shorter)
    • one for the district letter (pad if missing)
    • space
    • one for the sector
    • two for the unit

    and GIR 0AA is a slight exception again. If you pad with spaces then the sort order should be correct. Examples using # to represent a space:

    • W1#1AA => W##1##1AA
    • WC1#1AA => WC#1##1AA
    • W10#1AA => W#10##1AA
    • W1W#1AA => W##1W#1AA
    • GIR#0AA => GI#R##0AA
    • WC10#1AA => WC10##1AA
    • WC1W#1AA => WC#1W#1AA

    You need to right-pad the area if it's too short: left-padding produces the wrong sort order. All of the single letter areas - B, E, G, L, M, N, S, W - would sort before all of the two-letter areas - AB, AL, ..., ZE - if you left-padded

    The district number needs to be left padded to ensure that the natural W1, W2, ..., W9, W10 order remains intact