I'm trying to build n-grams which don't cross a period symbol. Split() only works for functions and list[index] only works with an index. Is there a way to access/split/divide a list by giving it a string/an element? Here is a snippet of my current function:
text = ["split","this","stuff",".","my","dear"]
def generate_ngram(rawlist, ngram_order):
"""
Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
"""
list_of_tuples = []
for i in range(0, len(rawlist) - ngram_order + 1):
ngram_order_index = i + ngram_order
generated_ngram = rawlist[i : ngram_order_index]
#if "." in generated_ngram:
#generated_ngram . . .
generated_tuple = tuple(generated_ngram)
list_of_tuples.append(generated_tuple)
return set(list_of_tuples)
generate_ngram(text,3)
currently returns:
{('.', 'my', 'dear'),
('stuff', '.', 'my'),
('split', 'this', 'stuff'),
('this', 'stuff', '.')}
but it should ideally return:
{('split', 'this', 'stuff'),
('this', 'stuff', '.')}
Any idea on how to achieve this? Thanks for your help!
I'm not sure if this is exactly what you need, but this function generates ngrams that can only contain stop words (in this case period) at the end:
STOPWORDS = {"."}
def generate_ngram(rawlist, ngram_order):
# All ngrams
ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
# Generate only those ngrams that do not contain stop words before the end
return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))
text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')
Note this function returns a generator. You can convert it to a list wrapping it with list(...)
if you want, or you can directly iterate over it.
EDIT: You may find the equivalent syntax below more readable.
def generate_ngram(rawlist, ngram_order):
# Iterate over all ngrams
for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
# Yield only those not containing stop words before the end
if not any(w in STOPWORDS for w in ngram[:-1]):
yield ngram