Search code examples
pythonregexalgorithm

Algorithm to filter the address from a large text


I am trying to grab part of this string, I am looking for it to start grabbing the string at the first digit in the string and copy the entire string all the away until the end digits.

import re

string = "['Today is the open house of 1234 High Drive, Denver, COLORADO 80204; open to the Public "

property_address = re.findall('\d-\d\d\d\d\d', str(string))

print(property_address)

Code above does not work, I'm a bit confused on how to tell Regex, start on first digit you find and grab until you find 5 digit sequence.


Solution

  • You can use:

    import re
    
    s = """
    aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204 aldjfladjfa alsdjflaksjdf 
    aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204 - 1829
    aldjfladjfa alsdjflaksjdf  1234 High Drive, Denver, COLORADO 00204 - 1829
    aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf 
    aldjfladjfa alsdjflaksjdf 1234 High Drive, 3rd, 4th phone number 1391713917 Denver, COLORADO 00204 - 1829 aldfjald
    
    """
    
    p = r'\b[1-9].*[0-9]{5}(?:-[0-9]{4}\b)?'
    
    find_address = re.findall(p, s)
    
    print(find_address)
    
    

    Prints

    ['1234 High Drive, Denver, COLORADO 80204', '1234 High Drive, Denver, COLORADO 80204', '1234 High Drive, Denver, COLORADO 00204', '1234 High Drive, 3rd, 4th phone number 1391713917 Denver, COLORADO 00204']

    Notes

    • Occasionally, there is a - and four digits after zipcode. Right? That should be considered.

    \b[1-9].*[0-9]{5}(?:-[0-9]{4}\b)?:

    • \b is a word boundary.
    • [1-9] assumes that the address starts with [1-9] numbers and not 0. If you want 0, then use \b[0-9].*[0-9]{5}(?:-[0-9]{4}\b)?.
    • (?:-[0-9]{4}\b)? is an optional group. It means, if the group is in the text, will take it, otherwise no.
    • [0-9]{5} means all digits, only five times.

    Edge cases

    • Just in case, if we had multiple addresses in one input, then we use the lazy matching as opposed to greedy.

    \b[1-9].*?[0-9]{5}\b(?:-[0-9]{4}\b)?

    Code

    import re
    
    s = """ aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204 aldjfladjfa alsdjflaksjdf 
    aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829
    aldjfladjfa alsdjflaksjdf  1234 High Drive, Denver, COLORADO 00204-1829
    aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf 
    aldjfladjfa alsdjflaksjdf 1234 High Drive, 3rd, 4th phone numbers (391) 871-3912 1-391-871-3912 Denver, COLORADO 00204-1829 aldfjald
    
    aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204 - 1829aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf 
    
    """
    
    p = r'\b[1-9].*?[0-9]{5}\b(?:-[0-9]{4}\b)?'
    
    find_address = re.findall(p, s)
    
    print(find_address)
    
    
    

    Prints

    ['1234 High Drive, Denver, COLORADO 80204', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 00204-1829', '1234 High Drive, 3rd, 4th phone numbers (391) 871-3912 1-391-871-3912 Denver, COLORADO 00204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204', '1829aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829']


    Algorithms

    • If you need 100% accuracy, it's best to write algorithms to solve this problem. Then, continuously improve your algorithms to deal with any edge cases.

    • Otherwise, fixing one or multiple regex would be tedious. Note that you can also use regex in your algorithms, if you want.


    • For example, you can create some "known" sets. e.g., check and see if there is a state/country and that is in the state/country set. You might be able to even do this for cities/zip-codes.

    • Similarly, for written numbers or other special words in the addres (One, Two, Three etc.), you can use sets.

    • Address usually has a limited length. Maybe 50-100 chars? Right? You can use this info in your algorithms.

    • Address does not have certain chars (e.g., %, $, *, etc.) and you can use it to test against those.

    • Addresses have similar and common chars/words (e.g., South, S., North, N., West, W., Avenue, AVE, Ave., Street, St., BLVD, Boulevard, Blvd., Suite, STE, Apt., Apartment, etc.). You can use these substrings in a set, and check and see if any of them are in a possible address string. This will increases the probability of that string being an address.

    • The list goes on and on....


    Algorithm is a better choice, it is easier to understand and maintain.

    Example Algorithm:

    
    import re
    
    s = """aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204 aldjfladjfa alsdjflaksjdf
    aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829
    aldjfladjfa alsdjflaksjdf  1234 High Drive, Denver, COLORADO 00204-1829
    aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf
    aldjfladjfa alsdjflaksjdf 1234 High Drive, 3rd, 4th phone numbers (391) 871-3912 1-391-871-3912 Denver, COLORADO 00204-1829 aldfjald
    
    aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf 1234 High Drive, Denver, COLORADO 80204-1829 aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf aldjfladjfa alsdjflaksjdf
    
    wer and Lender covenant and agree as follows:1. Property in Trust. Borrower, in consideration of the indebtedness herein recited and the trust herein created, her - 6620 FAWN PATH LANE, CASTLE PINES, COLORADO 80108
    and the Public Trustee of the County in whichthe Property(see § 1) is situated(Trustee)
    for the benefit of THE EDILEEN M. BIRNBAUM _TESTAMENTARY TRUST(Lender), whose address is 3617 SPRINGBROOK STREET, DALLAS, TEXAS 75205.Borrower and Lender covenant and agree as follows:
        1. Property in Trust. Borrower, in consideration of the indebtedness herein recited and the trust herein created, her -
    """
    
    states = {
        'AL': 'Alabama',
        'AK': 'Alaska',
        'AZ': 'Arizona',
        'AR': 'Arkansas',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'HI': 'Hawaii',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'IA': 'Iowa',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'ME': 'Maine',
        'MD': 'Maryland',
        'MA': 'Massachusetts',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MS': 'Mississippi',
        'MO': 'Missouri',
        'MT': 'Montana',
        'NE': 'Nebraska',
        'NV': 'Nevada',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NY': 'New York',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VT': 'Vermont',
        'VA': 'Virginia',
        'WA': 'Washington',
        'WV': 'West Virginia',
        'WI': 'Wisconsin',
        'WY': 'Wyoming'
    }
    
    patterns = r'\b[1-9].*?[0-9]{5}\b(?:-[0-9]{4}\b)?'
    
    
    def get_possible_addreses(s):
        possible_addresses = re.findall(patterns, s)
        return possible_addresses
    
    
    def get_states():
        states_in_address = []
        for k, v in states.items():
            states_in_address += [k.lower(), v.lower()]
        return states_in_address
    
    
    def may_be_an_address(s, size=50):
        words = s.split()[::-1]
        check_words = []
        numbers = []
        is_state = []
        for word in words:
            if word.lower() in get_states():
                is_state += [word]
            if word.isdigit():
                numbers += [int(word)]
            check_words += [word]
            size -= len(word)
            if len(numbers) >= 2 and size <= 0:
                break
            if size <= 0 and len(numbers) < 2:
                size += 5
        possible_address = ' '.join(check_words[::-1]) if is_state else None
    
        return possible_address if re.match("^" + patterns + "$", possible_address) else None
    
    
    res = []
    
    for s in get_possible_addreses(s):
        possible_address = may_be_an_address(s, 50)
        if possible_address is not None:
            res += [possible_address]
    print(res)
    
    

    Prints

    ['1234 High Drive, Denver, COLORADO 80204', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 00204-1829', '1234 High Drive, 3rd, 4th phone numbers (391) 871-3912 1-391-871-3912 Denver, COLORADO 00204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '1234 High Drive, Denver, COLORADO 80204-1829', '3617 SPRINGBROOK STREET, DALLAS, TEXAS 75205']

    Note that regex is just a set of algorithms, can be written in a shorter form.

    • The above algorithm "works", but is not complete (e.g., modify it for multiple words states such as New Mexico or countries, if any). You need to add more methods and modify some methods. My point is that, it is better to stay away from regex, as much as possible, for these type of problems.

    • Note that, I intentionally wrote the regex "pretty loose" and did not add too much complexity in it.