Search code examples
pythonnlpnltkn-gramlanguage-model

How to create window/chunk for list of sentences?


I have list of sentence and I want to create skipgram (window size = 3) but I DONT want the counter to span across sentences since they are all unrelated.

So, if I have the sentences:

[["my name is John"] , ["This PC is black"]]

the triplets will be:

[my name is]
[name is john]
[this PC is]
[PC is black]

What is the best way to do it?


Solution

  • Try this!

    from nltk import ngrams
    
    def generate_ngrams(sentences,window_size =3):
        for sentence in sentences:
            yield from ngrams(sentence[0].split(), window_size)
    
    sentences= [["my name is John"] , ["This PC is black"]]
    
    for c in generate_ngrams(sentences,3):
        print (c)
    
    #output:
    ('my', 'name', 'is')
    ('name', 'is', 'John')
    ('This', 'PC', 'is')
    ('PC', 'is', 'black')