I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and 'destination').
The data:
df_col = [
'new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]
This should result in:
Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
Thus far I have tried:
A variety of NLTK methods, but what has gotten me closest is using the nltk.pos_tag
method to tag each word in the string. The result is a list of tuples with each word and associated tag. Here's an example...
[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]
I am stuck at this stage and am unsure how to best implement this. Can anyone point me in the right direction, please? Thanks.
Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components.
From first look, it seems like you're asking to solve a natural language problem magically. But lets break it down and scope it to a point where something is buildable.
First, to identify countries and cities, you need data that enumerates them, so lets try: https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json
And top of the search results, we find https://datahub.io/core/world-cities that leads to the world-cities.json file. Now we load them into sets of countries and cities.
import requests
import json
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])
Lets put them together.
import requests
import json
from flashtext import KeywordProcessor
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])
[out]:
['York', 'Venice', 'Italy']
Doing due diligence, first hunch is that "new york" is not in the data,
>>> "New York" in cities
False
What the?! #$%^&* For sanity sake, we check these:
>>> len(countries)
244
>>> len(cities)
21940
Yes, you cannot just trust a single data source, so lets try to fetch all data sources.
From https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json, you find another link https://github.com/dr5hn/countries-states-cities-database Lets munge this...
import requests
import json
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])
dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"
cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))
countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])
countries = countries2.union(countries1)
cities = cities2.union(cities1)
>>> len(countries)
282
>>> len(cities)
127793
Wow, that's a lot more cities than previously.
Lets try the flashtext
code again.
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])
[out]:
['York', 'Venice', 'Italy']
Okay, for more sanity checks, lets just look for "york" in the list of cities.
>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
'West York',
'West New York',
'Yorktown Heights',
'East Riding of Yorkshire',
'Yorke Peninsula',
'Yorke Hill',
'Yorktown',
'Jefferson Valley-Yorktown',
'New York Mills',
'City of York',
'Yorkville',
'Yorkton',
'New York County',
'East York',
'East New York',
'York Castle',
'York County',
'Yorketown',
'New York City',
'York Beach',
'Yorkshire',
'North Yorkshire',
'Yorkeys Knob',
'York',
'York Town',
'York Harbor',
'North York']
You: What kind of prank is this?!
Linguist: Welcome to the world of natural language processing, where natural language is a social construct subjective to communal and idiolectal variant.
You: Cut the crap, tell me how to solve this.
NLP Practitioner (A real one that works on noisy user-generate texts): You just have to add to the list. But before that, check your metric given the list you already have.
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]
# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0
for text, label in texts_labels:
extracted = keyword_processor.extract_keywords(text)
# We're making some assumptions here that the order of
# extracted and the truth must be the same.
true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
total_truth += len(label)
# Just visualization candies.
print(text)
print(extracted)
print(label)
print()
Actually, it doesn't look that bad. We get an accuracy of 90%:
>>> true_positives / total_truth
0.9
Alright, alright, so look at the "only" error that the above approach is making, it's simply that "New York" isn't in the list of cities.
You: Why don't we just add "New York" to the list of cities, i.e.
keyword_processor.add_keyword('New York')
print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))
[out]:
['New York', 'Venice', 'Italy']
You: See, I did it!!! Now I deserve a beer.
Linguist: How about 'I live in Marawi'
?
>>> keyword_processor.extract_keywords('I live in Marawi')
[]
NLP Practitioner (chiming in): How about 'I live in Jeju'
?
>>> keyword_processor.extract_keywords('I live in Jeju')
[]
A Raymond Hettinger fan (from farway): "There must be a better way!"
Yes, there is what if we just try something silly like adding keywords of cities that ends with "City" into our keyword_processor
?
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
print(c[:-5])
Now lets retry our regression test examples:
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
('I live in Florida', ('Florida')),
('I live in Marawi', ('Marawi')),
('I live in jeju', ('Jeju'))]
# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0
for text, label in texts_labels:
extracted = keyword_processor.extract_keywords(text)
# We're making some assumptions here that the order of
# extracted and the truth must be the same.
true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
total_truth += len(label)
# Just visualization candies.
print(text)
print(extracted)
print(label)
print()
[out]:
new york to venice, italy for usd271
['New York', 'Venice', 'Italy']
('New York', 'Venice', 'Italy')
return flights from brussels to bangkok with etihad from €407
['Brussels', 'Bangkok']
('Brussels', 'Bangkok')
from los angeles to guadalajara, mexico for usd191
['Los Angeles', 'Guadalajara', 'Mexico']
('Los Angeles', 'Guadalajara')
fly to australia new zealand from paris from €422 return including 2 checked bags
['Australia', 'New Zealand', 'Paris']
('Australia', 'New Zealand', 'Paris')
I live in Florida
['Florida']
Florida
I live in Marawi
['Marawi']
Marawi
I live in jeju
['Jeju']
Jeju
But seriously, this is only the tip of the problem. What happens if you have a sentence like this:
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
['Adam', 'Bangkok', 'Singapore', 'China']
WHY is Adam
extracted as a city?!
Then you do some more neurotic checks:
>>> 'Adam' in cities
Adam
Congratulations, you've jumped into another NLP rabbit hole of polysemy where the same word has different meaning, in this case, Adam
most probably refer to a person in the sentence but it is also coincidentally the name of a city (according to the data you've pulled from).
[in]:
['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]
[out]:
Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
Linguist: Even with the assumption that the preposition (e.g. from
, to
) preceding the city gives you the "origin" / "destination" tag, how are you going to handle the case of "multi-leg" flights, e.g.
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
What's the desired output of this sentence:
> Adam flew to Bangkok from Singapore and then to China
Perhaps like this? What is the specification? How (un-)structured is your input text?
> Origin: Singapore
> Departure: Bangkok
> Departure: China
Lets take that assumption you have and try some hacks to the same flashtext
methods.
What if we add to
and from
to the list?
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
for text in texts:
extracted = keyword_processor.extract_keywords(text)
print(text)
print(extracted)
print()
[out]:
new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']
return flights from brussels to bangkok with etihad from €407
['from', 'Brussels', 'to', 'Bangkok', 'from']
from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
fly to australia new zealand from paris from €422 return including 2 checked bags
['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']
Okay, lets work with the above output and see what we do about the problem 1. Maybe check if the term after the from is city, if not, remove the to/from?
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
for text in texts:
extracted = keyword_processor.extract_keywords(text)
print(text)
new_extracted = []
extracted_next = extracted[1:]
for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
print(e_i, e_iplus1)
continue
elif e_i == 'from' and e_iplus1 == None: # last word in the list.
continue
else:
new_extracted.append(e_i)
print(new_extracted)
print()
That seems to do the trick and remove the from
that doesn't precede a city/country.
[out]:
new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']
return flights from brussels to bangkok with etihad from €407
from None
['from', 'Brussels', 'to', 'Bangkok']
from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
fly to australia new zealand from paris from €422 return including 2 checked bags
from None
['to', 'Australia', 'New Zealand', 'from', 'Paris']
Linguist: Think carefully, should ambiguity be resolved by making an informed decision to make ambiguous phrase obvious? If so, what is the "information" in the informed decision? Should it follow a certain template first to detect the information before filling in the ambiguity?
You: I'm losing my patience with you... You're bringing me in circles and circles, where's that AI that can understand human language that I keep hearing from the news and Google and Facebook and all?!
You: What you gave me are rule based and where's the AI in all these?
NLP Practitioner: Didn't you wanted 100%? Writing "business logics" or rule-based systems would be the only way to really achieve that "100%" given a specific data set without any preset data set that one can use for "training an AI".
You: What do you mean by training an AI? Why can't I just use Google or Facebook or Amazon or Microsoft or even IBM's AI?
NLP Practitioner: Let me introduce you to
Welcome to the world of Computational Linguistics and NLP!
Yes, there's no real ready-made magical solution and if you want to use an "AI" or machine learning algorithm, most probably you would need a lot more training data like the texts_labels
pairs shown in the above example.