Search code examples
pythonnlpdata-sciencedata-extraction

Map city names to countries - python?


I have a dataframe that represents the location of some people.

This dataframe is not cleaned and the names are a mess. some rows have only the country name, others have name and city, and others have only the city. I also have sentences that are not in English.

How can I use python with NLP to tidy this dataset and get a homogenous dataset?

Here is a screenshot of the dataset: enter image description here

Thanks in advance


Solution

  • I'm unable to comment, but you weren't clear what exactly you want to extract from this series? If you were just trying to find every instance of "Named Location" and make new Series from them, you probably are looking for Named Entity Recognition (NER). NLTK is a good place to start with NER, and they have a pretty good tutorial on how to use it for getting specific types of named entities (see Section 5, Named Entity Recognition).

    In short I would start with something like

    import nltk
    ser = #<your series of strings>
    locations = df.apply(lambda x:nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(str(x)))))
    

    But NLP is a complicated task, and as has been discussed, NER is particularly difficult.