Search code examples
pythonnlpnltkpos-tagger

How can I remove POS tags before slashes in nltk?


This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))

I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.

How to do that?

What I tried

First wrote this in a text file, tokenize and used list.remove('word'). But that is not at all helpful. I am clarifying a bit more.

My Input

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))

Output will be

[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.


Solution

  • Since you tagged this nltk, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.

    >>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
    >>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
    >>> print(tree.leaves())
    
    ['Jack', 'stayed', 'in', 'London']
    

    The lambda form splits each word/tag pair and discards the tag, keeping just the word.

    Multiple trees

    I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader, but it expects terminals to be in the form (POS word) instead of word/POS. I won't bother doing it that way, since it's even easier to trick Tree.fromstring() into reading all your trees as if they're branches of a single tree:

    allmytext = """
    (CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
    (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
    (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
    """
    wrapped = "(ROOT "+ allmytext + " )"  # Add a "root" node at the top
    trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
    for tree in trees:
        print(tree.leaves())
    

    As you see, the only difference is we added "(ROOT " and " )" around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.