Search code examples
mysqldatabaseautosuggeststreet-address

What is the best way to work with a huge address database for autosuggest?


I have a database of about 300,000,000 addresses for a country. This data is supposed to be used for auto-suggest in a form, i.e: if the address is 1900 Country Hill Road, user would type Count (assuming at least 3 characters to trigger auto-suggest is good idea), and the form may offer user with the first 10 results, i.e:

165 Countable Road
1890 Countable Road
44 Counting Circle
900 Country Drive
13 Country Hill Road
1900 Country Hill Road
344 Country Mountain Place
etc.

Choosing address also fills in the City and ZIP. Note: This is in a country that format addresses as Street ##/##, so the actual name is the first thing to be typed.

Doing a SELECT * FROM addresses WHERE street LIKE 'Count%' LIMIT 10 query takes about 3 seconds on average. Selecting just the street, house_number, city, zip is slightly faster. Each row has 19 columns. I have added an index on one of the columns based on some MySQL guides, but that didn't help noticably. I want to get under 500ms per request.

What would be the best way to resolve this? Things that have come to mind are:

  • dividing the database table into multiple tables based on the streets' first letters and selecting the table based on what the user types. This would mean I would end up with about 26 tables.
  • converting everything into multiple JSON files
  • further looking into optimizing the database

Solution

  • Let me step through the analysis.

    1. You are searching for street address that starts with the given string, correct?
    2. WHERE street LIKE 'Count%', together with INDEX(street) is reasonably optimal. No separate table, no JSON, no initial letter, etc.
    3. LIMIT 10 will find the first 10 such items efficiently. Note: Adding an ORDER BY could wreck performance.
    4. SELECT street finds the result in the index's B+Tree. Since this is all you need, use it. SELECT <<some columns>> or SELECT * has to do an extra action to get the other column(s).

    Such an index will drill down the B+Tree to the first occurrence of "Count...", then it will scan forward from there. Even if the dataset cannot be cached in RAM, there would be effectively only one disk hit, which would take millisecond(s). If you need extra columns, then the extra action would probably require 10 more disk hits (due to LIMIT 10).

    So, it is important that the INDEX contain exactly what you need to receive for the autosuggest.

    ORDER BY would be OK only if you say ORDER BY street (or ORDER BY street DESC). Otherwise, it must find the thousands (millions?) of streets starting with "count", sort this list, then deliver 10.

    Based on that analysis, the "minimum of 3 characters" is not important. However, my UI gut feeling is that 3 is reasonable.

    Oh, another issue. Since there is probably "1 Main St" in thousands of towns, auto select for "Main" is likely to show 10 copies of the same "1 Main St".

    Oh, I have not handled something that you imply from you example. street includes the street number. Or maybe you have two columns -- street_name (useful for searching) and street_number. So...

    Have two columns:

    • One for autosuggest ("Main St")
    • One for display ("1 Main St, Acron OH")

    Then have INDEX(auto_suggest, display). This will give you the efficiency of minimal effort to get the autosuggest values, but allows equally efficient display of the full address.

    But that still has the problem of dup addresses. OK, I am now forced to recommend an extra table. It would contain all the unique street names, no street_number, etc.

    CREATE TABLE AutoSuggest (
        street VARCHAR(99) NOT NULL,
        PRIMARY KEY(street)
    ) ENGINE=InnoDB;
    

    Note that, in MySQL, a PRIMARY KEY is a B+Tree index clustered with the data and is UNIQUE. This table might be about 1GB for 200M addresses, after compensating for dups, overhead, etc. After your app has run for some time, most of the table would be cached in the buffer_pool. This makes the typical autosuggest lookup be about 1 millisecond.

    But, that only give you the street name, not the address, etc. Perhaps you want a second column called full_street (and change to PRIMARY KEY(street, full_street)):

    INSERT INTO AutoSuggest (street, full_street)
        VALUES
        ("Main St", "1 Main St");
    

    This provides most of the benefits and few of the drawbacks mentioned before.

    Ponder my comments with the idea of tradeoffs between your two goals:

    • Fast autosuggest;
    • What the UI presents during the autosuggest.