Search code examples
pythonnlpn-gram

How to split a text into N-grams and get their offset


I want to split a text into ngrams but also get their offset in the text.
I am currently using the NLTK library in Python, but I didn't find any native way to get back the offset of the N-grams.
I did find this answer, but I was wondering if there is any library that offers this without having to implement it. My issue is that I have multiple occurrences of the same N-gram in the text I want to split.

The example usage would be:

    ngrams_with_offset("I like apples and I like oranges", 2)
    >>> [("I", "like", offset=0), 
         ("like", "apples", offset=2),
          ....... 
          ("I", "like", offset=18),
          ..... ]


Solution

  • I did not find any native way to do this, so I implemented my own to fit my use case, using the align_tokens() function in NLTK.
    It resembles something like this:

    tokenized_text = [word for word in word_tokenize(text) if word.lower() not in stopwords]
    alignment = align_tokens(tokenized_text, text)
    tokenized_with_offset = [(tokenized_text[i], alignment[i]) for i in range(len(alignment))]
    ngrams(tokenized_with_offset, n)