Search code examples
pythonregexpandasnlpnltk

Parsing city of origin / destination city from a string


I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and 'destination').

The data:

df_col = [
    'new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags'
]

This should result in:

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

Thus far I have tried: A variety of NLTK methods, but what has gotten me closest is using the nltk.pos_tag method to tag each word in the string. The result is a list of tuples with each word and associated tag. Here's an example...

[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]

I am stuck at this stage and am unsure how to best implement this. Can anyone point me in the right direction, please? Thanks.


Solution

  • TL;DR

    Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components.

    In Long

    From first look, it seems like you're asking to solve a natural language problem magically. But lets break it down and scope it to a point where something is buildable.

    First, to identify countries and cities, you need data that enumerates them, so lets try: https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json

    And top of the search results, we find https://datahub.io/core/world-cities that leads to the world-cities.json file. Now we load them into sets of countries and cities.

    import requests
    import json
    
    cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
    cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
    
    countries = set([city['country'] for city in cities_json])
    cities = set([city['name'] for city in cities_json])
    

    Now given data, lets try to build component ONE:

    • Task: Detect if any substring in the texts matches a city/country.
    • Tool: https://github.com/vi3k6i5/flashtext (a fast string search/match)
    • Metric: No. of correctly identified cities/countries in string

    Lets put them together.

    import requests
    import json
    from flashtext import KeywordProcessor
    
    cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
    cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
    
    countries = set([city['country'] for city in cities_json])
    cities = set([city['name'] for city in cities_json])
    
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    keyword_processor.extract_keywords(texts[0])
    

    [out]:

    ['York', 'Venice', 'Italy']
    

    Hey, what went wrong?!

    Doing due diligence, first hunch is that "new york" is not in the data,

    >>> "New York" in cities
    False
    

    What the?! #$%^&* For sanity sake, we check these:

    >>> len(countries)
    244
    >>> len(cities)
    21940
    

    Yes, you cannot just trust a single data source, so lets try to fetch all data sources.

    From https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json, you find another link https://github.com/dr5hn/countries-states-cities-database Lets munge this...

    import requests
    import json
    
    cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
    cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))
    
    countries1 = set([city['country'] for city in cities1_json])
    cities1 = set([city['name'] for city in cities1_json])
    
    dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
    dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"
    
    cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
    countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))
    
    countries2 = set([c['name'] for c in countries2_json])
    cities2 = set([c['name'] for c in cities2_json])
    
    countries = countries2.union(countries1)
    cities = cities2.union(cities1)
    

    And now that we are neurotic, we do sanity checks.

    >>> len(countries)
    282
    >>> len(cities)
    127793
    

    Wow, that's a lot more cities than previously.

    Lets try the flashtext code again.

    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    
    keyword_processor.extract_keywords(texts[0])
    

    [out]:

    ['York', 'Venice', 'Italy']
    

    Seriously?! There is no New York?! $%^&*

    Okay, for more sanity checks, lets just look for "york" in the list of cities.

    >>> [c for c in cities if 'york' in c.lower()]
    ['Yorklyn',
     'West York',
     'West New York',
     'Yorktown Heights',
     'East Riding of Yorkshire',
     'Yorke Peninsula',
     'Yorke Hill',
     'Yorktown',
     'Jefferson Valley-Yorktown',
     'New York Mills',
     'City of York',
     'Yorkville',
     'Yorkton',
     'New York County',
     'East York',
     'East New York',
     'York Castle',
     'York County',
     'Yorketown',
     'New York City',
     'York Beach',
     'Yorkshire',
     'North Yorkshire',
     'Yorkeys Knob',
     'York',
     'York Town',
     'York Harbor',
     'North York']
    

    Eureka! It's because it's call "New York City" and not "New York"!

    You: What kind of prank is this?!

    Linguist: Welcome to the world of natural language processing, where natural language is a social construct subjective to communal and idiolectal variant.

    You: Cut the crap, tell me how to solve this.

    NLP Practitioner (A real one that works on noisy user-generate texts): You just have to add to the list. But before that, check your metric given the list you already have.

    For every texts in your sample "test set", you should provide some truth labels to make sure you can "measure your metric".

    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
    ('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
    ('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
    ('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]
    
    # No. of correctly extracted terms.
    true_positives = 0
    false_positives = 0
    total_truth = 0
    
    for text, label in texts_labels:
        extracted = keyword_processor.extract_keywords(text)
    
        # We're making some assumptions here that the order of 
        # extracted and the truth must be the same.
        true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
        false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
        total_truth += len(label)
    
        # Just visualization candies.
        print(text)
        print(extracted)
        print(label)
        print()
    

    Actually, it doesn't look that bad. We get an accuracy of 90%:

    >>> true_positives / total_truth
    0.9
    

    But I %^&*(-ing want 100% extraction!!

    Alright, alright, so look at the "only" error that the above approach is making, it's simply that "New York" isn't in the list of cities.

    You: Why don't we just add "New York" to the list of cities, i.e.

    keyword_processor.add_keyword('New York')
    
    print(texts[0])
    print(keyword_processor.extract_keywords(texts[0]))
    

    [out]:

    ['New York', 'Venice', 'Italy']
    

    You: See, I did it!!! Now I deserve a beer. Linguist: How about 'I live in Marawi'?

    >>> keyword_processor.extract_keywords('I live in Marawi')
    []
    

    NLP Practitioner (chiming in): How about 'I live in Jeju'?

    >>> keyword_processor.extract_keywords('I live in Jeju')
    []
    

    A Raymond Hettinger fan (from farway): "There must be a better way!"

    Yes, there is what if we just try something silly like adding keywords of cities that ends with "City" into our keyword_processor?

    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
                print(c[:-5])
    

    It works!

    Now lets retry our regression test examples:

    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
    
    texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
    ('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
    ('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
    ('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
    ('I live in Florida', ('Florida')), 
    ('I live in Marawi', ('Marawi')), 
    ('I live in jeju', ('Jeju'))]
    
    # No. of correctly extracted terms.
    true_positives = 0
    false_positives = 0
    total_truth = 0
    
    for text, label in texts_labels:
        extracted = keyword_processor.extract_keywords(text)
    
        # We're making some assumptions here that the order of 
        # extracted and the truth must be the same.
        true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
        false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
        total_truth += len(label)
    
        # Just visualization candies.
        print(text)
        print(extracted)
        print(label)
        print()
    

    [out]:

    new york to venice, italy for usd271
    ['New York', 'Venice', 'Italy']
    ('New York', 'Venice', 'Italy')
    
    return flights from brussels to bangkok with etihad from €407
    ['Brussels', 'Bangkok']
    ('Brussels', 'Bangkok')
    
    from los angeles to guadalajara, mexico for usd191
    ['Los Angeles', 'Guadalajara', 'Mexico']
    ('Los Angeles', 'Guadalajara')
    
    fly to australia new zealand from paris from €422 return including 2 checked bags
    ['Australia', 'New Zealand', 'Paris']
    ('Australia', 'New Zealand', 'Paris')
    
    I live in Florida
    ['Florida']
    Florida
    
    I live in Marawi
    ['Marawi']
    Marawi
    
    I live in jeju
    ['Jeju']
    Jeju
    

    100% Yeah, NLP-bunga !!!

    But seriously, this is only the tip of the problem. What happens if you have a sentence like this:

    >>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
    ['Adam', 'Bangkok', 'Singapore', 'China']
    

    WHY is Adam extracted as a city?!

    Then you do some more neurotic checks:

    >>> 'Adam' in cities
    Adam
    

    Congratulations, you've jumped into another NLP rabbit hole of polysemy where the same word has different meaning, in this case, Adam most probably refer to a person in the sentence but it is also coincidentally the name of a city (according to the data you've pulled from).

    I see what you did there... Even if we ignore this polysemy nonsense, you are still not giving me the desired output:

    [in]:

    ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags'
    ]
    

    [out]:

    Origin: New York, USA; Destination: Venice, Italy
    Origin: Brussels, BEL; Destination: Bangkok, Thailand
    Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
    Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
    

    Linguist: Even with the assumption that the preposition (e.g. from, to) preceding the city gives you the "origin" / "destination" tag, how are you going to handle the case of "multi-leg" flights, e.g.

    >>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
    

    What's the desired output of this sentence:

    > Adam flew to Bangkok from Singapore and then to China
    

    Perhaps like this? What is the specification? How (un-)structured is your input text?

    > Origin: Singapore
    > Departure: Bangkok
    > Departure: China
    

    Lets try to build component TWO to detect prepositions.

    Lets take that assumption you have and try some hacks to the same flashtext methods.

    What if we add to and from to the list?

    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
    
    keyword_processor.add_keyword('to')
    keyword_processor.add_keyword('from')
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    
    
    for text in texts:
        extracted = keyword_processor.extract_keywords(text)
        print(text)
        print(extracted)
        print()
    

    [out]:

    new york to venice, italy for usd271
    ['New York', 'to', 'Venice', 'Italy']
    
    return flights from brussels to bangkok with etihad from €407
    ['from', 'Brussels', 'to', 'Bangkok', 'from']
    
    from los angeles to guadalajara, mexico for usd191
    ['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
    
    fly to australia new zealand from paris from €422 return including 2 checked bags
    ['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']
    

    Heh, that's pretty crappy rule to use to/from,

    1. What if the "from" is referring the price of the ticket?
    2. What if there's no "to/from" preceding the country/city?

    Okay, lets work with the above output and see what we do about the problem 1. Maybe check if the term after the from is city, if not, remove the to/from?

    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
    
    keyword_processor.add_keyword('to')
    keyword_processor.add_keyword('from')
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    
    
    for text in texts:
        extracted = keyword_processor.extract_keywords(text)
        print(text)
    
        new_extracted = []
        extracted_next = extracted[1:]
        for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
            if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
                print(e_i, e_iplus1)
                continue
            elif e_i == 'from' and e_iplus1 == None: # last word in the list.
                continue
            else:
                new_extracted.append(e_i)
    
        print(new_extracted)
        print()
    

    That seems to do the trick and remove the from that doesn't precede a city/country.

    [out]:

    new york to venice, italy for usd271
    ['New York', 'to', 'Venice', 'Italy']
    
    return flights from brussels to bangkok with etihad from €407
    from None
    ['from', 'Brussels', 'to', 'Bangkok']
    
    from los angeles to guadalajara, mexico for usd191
    ['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
    
    fly to australia new zealand from paris from €422 return including 2 checked bags
    from None
    ['to', 'Australia', 'New Zealand', 'from', 'Paris']
    

    But the "from New York" still isn't solve!!

    Linguist: Think carefully, should ambiguity be resolved by making an informed decision to make ambiguous phrase obvious? If so, what is the "information" in the informed decision? Should it follow a certain template first to detect the information before filling in the ambiguity?

    You: I'm losing my patience with you... You're bringing me in circles and circles, where's that AI that can understand human language that I keep hearing from the news and Google and Facebook and all?!

    You: What you gave me are rule based and where's the AI in all these?

    NLP Practitioner: Didn't you wanted 100%? Writing "business logics" or rule-based systems would be the only way to really achieve that "100%" given a specific data set without any preset data set that one can use for "training an AI".

    You: What do you mean by training an AI? Why can't I just use Google or Facebook or Amazon or Microsoft or even IBM's AI?

    NLP Practitioner: Let me introduce you to

    Welcome to the world of Computational Linguistics and NLP!

    In Short

    Yes, there's no real ready-made magical solution and if you want to use an "AI" or machine learning algorithm, most probably you would need a lot more training data like the texts_labels pairs shown in the above example.