Search code examples
pythonnlptokenize

How to Tokenize group of words in Python


I am developing a application in python which gives job recommendation based on the resume uploaded. I am trying to tokenize resume before processing further. I want to tokenize group of words. For example Data Science is a keyword when i tokenize i will get data and science separately. How to overcome this situation. Is there any library which does these extraction in python?


Solution

  • Looks like you are looking to generate n-grams (bi-grams in particular). If that's the case, the following is one way to achieve this:

    from nltk import ngrams
    resume = '... working in the data science field for years ...'
    n = 2
    bigrams = ngrams(resume.split(), n)
    for grams in bigrams:
      print grams