This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.
How to do that?
First wrote this in a text file, tokenize and used list.remove('word')
. But that is not at all helpful.
I am clarifying a bit more.
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.
Since you tagged this nltk
, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.
>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())
['Jack', 'stayed', 'in', 'London']
The lambda form splits each word/tag
pair and discards the tag, keeping just the word.
I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader
, but it expects terminals to be in the form (POS word)
instead of word/POS
. I won't bother doing it that way, since it's even easier to trick Tree.fromstring()
into reading all your trees as if they're branches of a single tree:
allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )" # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
print(tree.leaves())
As you see, the only difference is we added "(ROOT "
and " )"
around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.