Search code examples
pythonregexgeonames

Python string regex union returns a bunch of empty strings


I'm trying to pass a concatenated list of strings as the regular expression to re.findall:

re.findall(regex, string)

But I'm getting just a bunch of empty strings in a pair of lists as a result.

re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
# [('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]

Where locations is a list like this:

['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', ...]

A manual test work like this:

print(re.findall('miami|zika', 'Zika Outbreak Hits Miami'.lower()))
# ['zika', 'miami']

But I don't know what's wrong with concatenating locations to create a big regex. Maybe is that? locations holds 24588 elements.

I'm currently creating the locations list from what geonamescache offers as cities and countries:

import geonamescache

gc = geonamescache.GeonamesCache()
countries = [country["name"].lower() for country in list(gc.get_countries().values())]
cities    = [city["name"].lower() for city in list(gc.get_cities().values())]
locations =  countries + cities

The text which I'm working with looks like this:

Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika

Solution

  • Take a look at your locations list and look for empty strings or anomalous location names in the list.

    For example: This works well

    In [1]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba']
    
    In [2]: import re
    
    In [3]: re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
    Out[3]: []
    
    In [4]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
    Out[4]: ['switzerland']
    

    And this doesn't because there is an empty location in my list

    In [5]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', '']
    
    In [6]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
    Out[6]:
    ['switzerland',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '',
     '']
    

    EDIT

    As expected, the special characters in locations are causing the problem in the code. You can use the following code to create the regex itself, it's mostly places like which are interfering with the regular expressions:

    In [21]: [l for l in locations if l.find('(') >= 0]
    Out[21]:
    ['zürich (kreis 11) / seebach',
     'zürich (kreis 11) / oerlikon',
     'zürich (kreis 10) / höngg',
     'zürich (kreis 4) / aussersihl',
     'zürich (kreis 10) / wipkingen',
     'zürich (kreis 11) / affoltern',
     'zürich (kreis 2) / wollishofen',
     'zürich (kreis 3) / sihlfeld',
     'zürich (kreis 6) / unterstrass',
     'zürich (kreis 9) / albisrieden',
     'zürich (kreis 9) / altstetten',
     'stadt winterthur (kreis 1)',
     'zürich (kreis 12)',
     'seen (kreis 3)',
     'zürich (kreis 3)',
     'zürich (kreis 11)',
     'zürich (kreis 9)',
     'oberwinterthur (kreis 2)',
     'zürich (kreis 10)',
     'zürich (kreis 2)',
     'zürich (kreis 8)',
     'zürich (kreis 7)',
     'zürich (kreis 6)',
     'wetter (ruhr)',
     'schwedt (oder)',
     'kempten (allgäu)',
     'kelkheim (taunus)',
     'halle (saale)',
     'frankfurt (oder)',
     'brake (unterweser)',
     'v.s.k.valasai (dindigul-dist.)',
     'dainava (kaunas)',
     'miguel alemán (la doce)',
     'jardines de la silla (jardines)',
     'licenciado benito juárez (campo gobierno)',
     'ampliación san mateo (colonia solidaridad)',
     'kalibo (poblacion)',
     'city of milford (balance)',
     'butte-silver bow (balance)']
    

    Create the regex using re.escape to take care of the special characters. You may also want to do a complete word match otherwise, partial words like brea from break will match

    In [21]: locations_regex = re.compile(r'|'.join([re.escape(l) for l in sorted(locations, key=lambda x:-len(x))]))