I have a CSV file with about 20 million rows that I'd like to use in my web application. The data is a mapping of postal/zip codes to actual street addresses in the following format:
[zip_or_postal_code] [street_number] [street_name] [city] [state_or_province] [country]
My goal is to keep my lookups (searching by zip/postal code) under 200ms.
I'm not sure if this would make a difference, but I was planning on doing the following:
state/province
, country
, and city
columns to their own tables and reference those in my primary table in order to avoid unnecessary bloat.What are some optimizations I could make to help with lookup speed? As an example, Google's reverse geolocation API returns a result in under 300 ms with HTTP overhead included. How do they do it?
Also, I am open to using other databases, but since I'm already using MySQL, that would be preferable.
Edit: The lookups will always be done by zip/postal code, so as an example: given the zip 12345 I'd need to return the street #(s)/name(s), city, state, and country. The street #(s)/name(s) will be stored as a single string field, however, so my app will take care of parsing them.
20 million rows is not a lot for MySQL. Just index the zip/postal code and it will be fast. Way under 200ms fast. No need to split between tables. MySQL does get slow when the result set is large, but it doesn't seem like you would encounter that issue. MySQL will do just fine with hundreds of millions of records for basic queries like yours.
You will need to adjust the MySQL settings so that it uses more memory. The default settings are pretty low.
MySQL does support spacial indexes. So you could pull the longitude/latitude for the postal codes and use a spacial index to do proximity searches. Doesn't seem like you are looking for that though.
If you want things really, really fast, go the route you were thinking of but use memcache or redis. You can use the zip/postal code as the lookup key. You would still need a persistant disk based data store to load the data from. I don't think memcache/redis is necessary, but it's an option.