I'm trying to pass a concatenated list of strings as the regular expression to re.findall
:
re.findall(regex, string)
But I'm getting just a bunch of empty strings in a pair of lists as a result.
re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
# [('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]
Where locations is a list like this:
['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', ...]
A manual test work like this:
print(re.findall('miami|zika', 'Zika Outbreak Hits Miami'.lower()))
# ['zika', 'miami']
But I don't know what's wrong with concatenating locations to create a big regex. Maybe is that? locations
holds 24588 elements.
I'm currently creating the locations list from what geonamescache offers as cities and countries:
import geonamescache
gc = geonamescache.GeonamesCache()
countries = [country["name"].lower() for country in list(gc.get_countries().values())]
cities = [city["name"].lower() for city in list(gc.get_cities().values())]
locations = countries + cities
The text which I'm working with looks like this:
Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika
Take a look at your locations list and look for empty strings or anomalous location names in the list.
For example: This works well
In [1]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba']
In [2]: import re
In [3]: re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
Out[3]: []
In [4]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[4]: ['switzerland']
And this doesn't because there is an empty location in my list
In [5]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', '']
In [6]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[6]:
['switzerland',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'']
As expected, the special characters in locations are causing the problem in the code. You can use the following code to create the regex itself, it's mostly places like which are interfering with the regular expressions:
In [21]: [l for l in locations if l.find('(') >= 0]
Out[21]:
['zürich (kreis 11) / seebach',
'zürich (kreis 11) / oerlikon',
'zürich (kreis 10) / höngg',
'zürich (kreis 4) / aussersihl',
'zürich (kreis 10) / wipkingen',
'zürich (kreis 11) / affoltern',
'zürich (kreis 2) / wollishofen',
'zürich (kreis 3) / sihlfeld',
'zürich (kreis 6) / unterstrass',
'zürich (kreis 9) / albisrieden',
'zürich (kreis 9) / altstetten',
'stadt winterthur (kreis 1)',
'zürich (kreis 12)',
'seen (kreis 3)',
'zürich (kreis 3)',
'zürich (kreis 11)',
'zürich (kreis 9)',
'oberwinterthur (kreis 2)',
'zürich (kreis 10)',
'zürich (kreis 2)',
'zürich (kreis 8)',
'zürich (kreis 7)',
'zürich (kreis 6)',
'wetter (ruhr)',
'schwedt (oder)',
'kempten (allgäu)',
'kelkheim (taunus)',
'halle (saale)',
'frankfurt (oder)',
'brake (unterweser)',
'v.s.k.valasai (dindigul-dist.)',
'dainava (kaunas)',
'miguel alemán (la doce)',
'jardines de la silla (jardines)',
'licenciado benito juárez (campo gobierno)',
'ampliación san mateo (colonia solidaridad)',
'kalibo (poblacion)',
'city of milford (balance)',
'butte-silver bow (balance)']
Create the regex using re.escape to take care of the special characters. You may also want to do a complete word match otherwise, partial words like brea
from break
will match
In [21]: locations_regex = re.compile(r'|'.join([re.escape(l) for l in sorted(locations, key=lambda x:-len(x))]))