NLP -- tokenizing correctly words like 'new york' or 'hip hop'

I'm working on a NLP project using as dataset amazon digital music reviews. I'm preprocessing all the reviews by lemmatizing, stemming, tokenizing, removing punctuations and stopwords...

However I got stuck in a problem. Is there a way to preprocessing the text by saying to python:

`if there is 'new york', 'los angeles', 'hip hop' like words, then do not split them but melt: 'new_york', 'los_angeles', 'hip_hop'

I do not want to map manually all of them and I tried to play with bigrams and with pos but with no success.

Can you help me?

Solution

Assuming you have a finite list of words you'd like to 'melt', you could use str.replace() on the text:

text = 'new york and applesauce and hip hop'
replacement_dict = {'new york':'new_york', 'hip hop':'hip_hop'}

for k in replacement_dict:
  text = text.replace(k,replacement_dict[k])

print(text)
>>> 'new_york and applesauce and hip_hop'

Since you said you don't want to manually map these, you'll have to come up with a way to identify commonly occurring bigrams, which are called 'collocations'. There's no single, definitive way to do this, but there are plenty of resources for creating collocation identifiers, two of which I've linked below.

https://www.geeksforgeeks.org/nlp-word-collocations/

https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a