Search code examples
phppreg-matchmatch

Extracting City and Zipcode from String in PHP


I need a fast generic way in PHP to extract the City and Zipcode (when available) information from an input string.

The string can be of the following forms

  1. $input_str = "123 Main Street, New Haven, CT";
  2. $input_str = "123 Main Street, New Haven, CT 06510";
  3. $input_str = "New Haven, CT, United States";
  4. $input_str = "New Haven, CT 06510";

I was thinking that for (1) & (3) atleast I could explode the input string with "," and then loop thru the array to find a 2 digit STATE character and ignore it. But I'm stuck beyond this point.

$search_values = explode(',' ,$input_str);
foreach($search_values as $search)
{
    $trim_search = trim($search);   // Remove any trailing white spaces

    // If the 2 digit State is provided without Zipcode, ignore it
    if (strlen($trim_search) == 2)
    {
        //echo 'Ignoring State Without Zipcode: ' . $search . '<br>';
        continue;
    }

    ...

Solution

  • I'm not the greatest with regex, but here's a shot in finding a 2 character state with or without a zip code.

    Regex: (([A-Z]{2})|[0-9]{5})+

    Fiddle

    However, if you want to match only when the state AND zip code are together, take a look at this: Regex: (([A-Z]{2})(\s*[0-9]{5}))+

    Fiddle

    class Extract  {
        
        private $_string;
        
        private $_sections = array();
        
        private $_output = array();
            
        private $_found = array();
        
        private $_original_string;
        
        private $_countries = array (
            'United States',
            'Canada',
            'Mexico',
            'France',
            'Belgium',
            'United Kingdom',
            'Sweden',
            'Denmark',
            'Spain',
            'Australia',
            'Austria',
            'Italy',
            'Netherlands'
        );
        
        private $_zipcon = array();
        
        private $ZIPREG = array(
            "United States"=>"^\d{5}([\-]?\d{4})?$",
            "United Kingdom"=>"^(GIR|[A-Z]\d[A-Z\d]??|[A-Z]{2}\d[A-Z\d]??)[ ]??(\d[A-Z]{2})$",
            "Germany"=>"\b((?:0[1-46-9]\d{3})|(?:[1-357-9]\d{4})|(?:[4][0-24-9]\d{3})|(?:[6][013-9]\d{3}))\b",
            "Canada"=>"^([ABCEGHJKLMNPRSTVXY]\d[ABCEGHJKLMNPRSTVWXYZ])\s*(\d[ABCEGHJKLMNPRSTVWXYZ]\d)$",
            "France"=>"^(F-)?((2[A|B])|[0-9]{2})[0-9]{3}$",
            "Italy"=>"^(V-|I-)?[0-9]{5}$",
            "Australia"=>"^(0[289][0-9]{2})|([1345689][0-9]{3})|(2[0-8][0-9]{2})|(290[0-9])|(291[0-4])|(7[0-4][0-9]{2})|(7[8-9][0-9]{2})$",
            "Netherlands"=>"^[1-9][0-9]{3}\s?([a-zA-Z]{2})?$",
            "Spain"=>"^([1-9]{2}|[0-9][1-9]|[1-9][0-9])[0-9]{3}$",
            "Denmark"=>"^([D-d][K-k])?( |-)?[1-9]{1}[0-9]{3}$",
            "Sweden"=>"^(s-|S-){0,1}[0-9]{3}\s?[0-9]{2}$",
            "Belgium"=>"^[1-9]{1}[0-9]{3}$"
        ); // thanks to http://www.pixelenvision.com/1708/zip-postal-code-validation-regex-php-code-for-12-countries/
        
        public function __construct($string) {
    
            $this->_output = array (
            
                "state" => "",
                "city" => "",
                "country" => "",
                "zip" => "",
                "street" =>"",
                "number" => ""
            );
            $this->_original_string = $string;
            $this->_string = $this->normalize(trim($string));
            
            
            // create an array of patterns in order to extract zip code using the country list we already have
            foreach($this->ZIPREG as $country => $pattern) {
                $this->_zipcon[] = $pattern = preg_replace( array("/\^/","/\\$/"),array("",""), $pattern);
            }
        
            $this->init();
    
        }
        
        protected function init() {
            
            $this->getData(); // get data that can be found without breaking up the string.
            
            $this->_sections = array_filter(explode(',', trim($this->_string)));  // split each section
    
            if(!empty($this->_sections)) {
                foreach($this->_sections as $i => $d) {
                    $d = preg_replace(array("/\s+/", "/\s([?.!])/"),  array(" ","$1"), $d ); 
                    $this->_sections[$i] = trim($this->normalize($d));  // normalize strin to have one spacing between each word
                }
            } else {
                $this->_sections[] = $this->_string;    
            }       
            
            // try to match what's missing with has already been found
            $notFound = $this->getNotFound();
            if(count($notFound)==1 && count($this->_found)>1) {
                $found = $this->getFound();
                foreach($found as $string) {
                    $notFound[0] = preg_replace("/$string/i", "", $notFound[0]);
                }
                $this->_output["city"] = $notFound[0];
                $this->_found[] = $this->_output["city"];
                $this->remove($this->_output["city"]);
            }   
        }
        
        public function getSections() {
            return $this->_sections;
        }   
        
        protected function normalize($string) {
            $string = preg_replace(array("/\s+/", "/\s([?.!])/"),  array(" ","$1"), trim($string));
            return $string;
        }
        
        protected function country_from_zip($zip) {
            $found = "";
            foreach($this->ZIPREG as $country => $pattern) {
                if(preg_match ("/".$pattern."/", $zip)) {
                    $found = $country;
                    break;
                }
            }
            return $found;
        }
        
        protected function getData() {
            $container = array();
            // extract zip code only when present beside state, or else five digits are meaningless
            
            if(preg_match ("/[A-Z]{2,}\s*(".implode('|', $this->_zipcon).")/", $this->_string) ){
                preg_match ("/[A-Z]{2,}\s*(".implode('|', $this->_zipcon).")/", $this->_string, $container["state_zip"]);
    
                $this->_output["state"] = $container["state_zip"][0];
                $this->_output["zip"] = $container["state_zip"][1];
                $this->_found[] = $this->_output["state"] . " ". $this->_output["zip"];
                // remove from string once found
                $this->remove($this->_output["zip"]);   
                $this->remove($this->_output["state"]);
                
                // check to see if we can find the country just by inputting zip code
                if($this->_output["zip"]!="" ) {
                    $country = $this->country_from_zip($this->_output["zip"]);
                    $this->_output["country"] = $country;
                    $this->_found[] = $this->_output["country"];
                    $this->remove($this->_output["country"]);
                }
            } 
            
            if(preg_match ("/\b([A-Z]{2,})\b/", $this->_string)) {
                preg_match ("/\b([A-Z]{2,})\b/", $this->_string, $container["state"]);  
                $this->_output["state"] = $container["state"][0];
                $this->_found[] = $this->_output['state'];
                $this->remove($this->_output["state"]);
            }
    
            // if we weren't able to find a country based on the zip code, use the one provided (if provided)
            if($this->_output["country"] == "" && preg_match("/(". implode('|',$this->_countries)  . ")/i", $this->_string) ){
                preg_match ("/(". implode('|',$this->_countries)  . ")/i", $this->_string, $container["country"]);
                $this->_output["country"] = $container["country"][0];
                $this->_found[] = $this->_output['country'];
                $this->remove($this->_output["country"]);
            }   
                
            if(preg_match ("/([0-9]{1,})\s+([.\\-a-zA-Z\s*]{1,})/", $this->_string) ){
                preg_match ("/([0-9]{1,})\s+([.\\-a-zA-Z\s*]{1,})/", $this->_string, $container["address"]);
                $this->_output["number"] = $container["address"][1];
                $this->_output["street"] = $container["address"][2];
                $this->_found[] = $this->_output["number"] . " ". $this->_output["street"];
                $this->remove($this->_output["number"]);
                $this->remove($this->_output["street"]);
            }       
            
            
            //echo $this->_string;
        }
        
        /* remove from string in order to make it easier to find missing this */
        protected function remove($string, $case_sensitive = false) {
            $s = ($case_sensitive==false ? "i" : "");
            $this->_string = preg_replace("/".$string."/$s", "", $this->_string);
        }
    
        public function getNotFound() {
            return array_values(array_filter(array_diff($this->_sections, $this->_found)));
        }
        
        public function getFound() {
            return $this->_found;   
        }
    
        /* outputs a readable string with all items found */
        public function toString() {
            $output = $this->getOutput();
            $string = "Original string: [ ".$this->_original_string.' ] ---- New string: [ '. $this->_string. ' ]<br>';
            foreach($output as $type => $data) {
                $string .= "-".$type . ": " . $data. '<br>';    
            }   
            return $string;
        }
        
        /* return the final output as an array */
        public function getOutput() {
            return $this->_output;  
        }   
        
    }
    
    
    
    $array = array();
    $array[0] = "123 Main Street, New Haven, CT 06518";
    $array[1] = "123 Main Street, New Haven, CT";
    $array[2] = "123 Main Street, New Haven,                            CT 06511";
    $array[3] = "New Haven,CT 66554, United States";
    $array[4] = "New Haven, CT06513";
    $array[5] = "06513";
    $array[6] = "123 Main    Street, New Haven CT 06518, united states";
    
    $array[7] = "1253 McGill College, Montreal, QC H3B 2Y5"; // google Montreal  / Canada
    $array[8] = "1600 Amphitheatre Parkway, Mountain View, CA 94043"; // google CA  / US
    $array[9] = "20 West Kinzie St., Chicago, IL 60654"; // google IL / US
    $array[10] = "405 Rue Sainte-Catherine Est, Montreal, QC"; // Montreal address shows hyphened street names
    $array[11] = "48 Pirrama Road, Pyrmont, NSW 2009"; // google Australia
    
    
    foreach($array as $string) {
        $a = new Extract($string);
    
        echo $a->toString().'<br>'; 
    }
    

    Using the example from the code above it should output:

    Original string: [ 123 Main Street, New Haven, CT 06518 ] ---- New string: [ , , ]
    -state: CT
    -city: New Haven
    -country: United States
    -zip: 06518
    -street: Main Street
    -number: 123
    
    Original string: [ 123 Main Street, New Haven, CT ] ---- New string: [ , , ]
    -state: CT
    -city: New Haven
    -country: 
    -zip: 
    -street: Main Street
    -number: 123
    
    Original string: [ 123 Main Street, New Haven, CT 06511 ] ---- New string: [ , , ]
    -state: CT
    -city: New Haven
    -country: United States
    -zip: 06511
    -street: Main Street
    -number: 123
    
    Original string: [ New Haven,CT 66554, United States ] ---- New string: [ , , ]
    -state: CT
    -city: New Haven
    -country: United States
    -zip: 66554
    -street: 
    -number: 
    
    Original string: [ New Haven, CT06513 ] ---- New string: [ , ]
    -state: CT
    -city: New Haven
    -country: United States
    -zip: 06513
    -street: 
    -number: 
    
    Original string: [ 06513 ] ---- New string: [ 06513 ]
    -state: 
    -city: 
    -country: 
    -zip: 
    -street: 
    -number: 
    
    Original string: [ 123 Main Street, New Haven CT 06518, united states ] ---- New string: [ , , ]
    -state: CT
    -city: New Haven
    -country: United States
    -zip: 06518
    -street: Main Street
    -number: 123
    
    Original string: [ 1253 McGill College, Montreal, QC H3B 2Y5 ] ---- New string: [ , , ]
    -state: QC
    -city: Montreal
    -country: Canada
    -zip: H3B 2Y5
    -street: McGill College
    -number: 1253
    
    Original string: [ 1600 Amphitheatre Parkway, Mountain View, CA 94043 ] ---- New string: [ , , ]
    -state: CA
    -city: Mountain View
    -country: United States
    -zip: 94043
    -street: Amphitheatre Parkway
    -number: 1600
    
    Original string: [ 20 West Kinzie St., Chicago, IL 60654 ] ---- New string: [ , , ]
    -state: IL
    -city: Chicago
    -country: United States
    -zip: 60654
    -street: West Kinzie St.
    -number: 20
    
    Original string: [ 405 Rue Sainte-Catherine Est, Montreal, QC ] ---- New string: [ , , ]
    -state: QC
    -city: Montreal
    -country: 
    -zip: 
    -street: Rue Sainte-Catherine Est
    -number: 405
    
    Original string: [ 48 Pirrama Road, Pyrmont, NSW 2009 ] ---- New string: [ , , ]
    -state: NSW
    -city: Pyrmont
    -country: Australia
    -zip: 2009
    -street: Pirrama Road
    -number: 48
    

    If you want to extract the actual stored values so you can use. You need to call the getOutput(). This will return an array with all the values necessary. If we take the first address on our list and output its values using this method, it should output:

    Array
    (
        [state] => CT
        [city] => New Haven
        [country] => United States
        [zip] => 06518
        [street] => Main Street
        [number] => 123
    )
    

    Please note that this class can be greatly optimized and improved. This is what I came up within the hour, so I cannot guarantee it will work for all types of inputs. In essence, you must make sure the user at least makes an effort in using commas to separate parts of the address. You also want to make sure that a capitalized state is provided and a valid five digit zip code.

    A few rules

    1. In order to extract a zip code, a valid 2 character state must be provided with a valid zip code beside it. Example: CT 06510. Without the state, simply inputting five digits is meaningless since there can also be five digits in the street number. (Cannot differentiate between the two).

    2. Street and Number can only be extracted if there is a number and a word(s) provided in sequence. Example: 123 Main Street. It also must be separated by a comma or it will capture all the words after the number. For example, 123 Main Street New Haven, CT 06518, the code will the that the street and number is 123 Main Street New Haven rather than 123 Main Street.

    3. Simply inputting a five digit zip code will not work.

    4. If a country is not given, it will guess the country provided that there is a valid zip code (see list of zip-codes and their respective countries above).

    5. It assumes that no hyphens will be provided (especially for city names). This can be modified later on. (Regex needs to be modified to accommodate hyphenated words for both city and street names). (fixed)

    6. The bottom line is that you can do a lot more if you have some time to change and modify the regular expressions and customize this accordingly.

    I would strongly suggest you to use forms (in case you don't already have) in order to easily capture the address provided in inputs. It'll probably make your life a lot easier.

    Quick use

    $Extract = new Extract("123 Main Street, New Haven, CT 06518");
    $foundValues = $Extract->getOutput();