I want to split a text into ngrams but also get their offset in the text.
I am currently using the NLTK library in Python, but I didn't find any native way to get back the offset of the N-grams.
I did find this answer, but I was wondering if there is any library that offers this without having to implement it. My issue is that I have multiple occurrences of the same N-gram in the text I want to split.
The example usage would be:
ngrams_with_offset("I like apples and I like oranges", 2)
>>> [("I", "like", offset=0),
("like", "apples", offset=2),
.......
("I", "like", offset=18),
..... ]
I did not find any native way to do this, so I implemented my own to fit my use case, using the align_tokens() function in NLTK.
It resembles something like this:
tokenized_text = [word for word in word_tokenize(text) if word.lower() not in stopwords]
alignment = align_tokens(tokenized_text, text)
tokenized_with_offset = [(tokenized_text[i], alignment[i]) for i in range(len(alignment))]
ngrams(tokenized_with_offset, n)