Search code examples
pythonnlpnltk

Python, NLP: How to find all trigrams from text files with adjectives as the middle term


I think the question is self-explanatory but here goes the detailed meaning of the question.

I want to extract all trigrams from text files using the nltk library having adjectives as the middle term.

Example Text - A red ball was with the good boy.

Example of output -

('A','red','ball'), ('the','good','boy')  

and so on


Solution

  • This code should do it:

    import nltk
    from nltk.tokenize import word_tokenize
    
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    
    text = word_tokenize("He is a very handsome man. Her childern are funny. She has a lovely voice")
    text_tags = nltk.pos_tag(text)
    results = list()
    for i, (txt, tag) in enumerate(text_tags):
        if tag in ["JJ", "JJR", "JJS"]:
            if (i > 0) and (i < len(text_tags)-1):
                results.append((text_tags[i-1][0], txt, text_tags[i+1][0]))
    
    # output: [('very', 'handsome', 'man'), ('are', 'funny', '.'), ('a', 'lovely', 'voice')]