While using pyspark and nltk, I want to get the length of all "NP" words and sort them in decending order. I am currently stuck on the navigation of the subtree.
example subtree output.
#>>>[(Tree('NP', [Tree('NBAR', [('WASHINGTON', 'NN')])]), 1)
Trying to get the length of all NP words. Then take those lengths and put them in descending order.
The first element would be words with length of 1 and the number of words and so on.
example:
#[(1, 6157),6157 words length of one
# (2, 1833),1833 words length of 2
# (3, 654),
# (4, 204),
# (5, 65)]
import nltk
import re
textstring = """This is just a bunch of words to use for this example.
John gave them to me last night but Kim took them to work.
Hi Stacy. URL:http://example.com. Jessica, Mark, Tiger, Book, Crow, Airplane, SpaceShip"""
TOKEN_RE = re.compile(r"\b[\w']+\b")
grammar = r"""
NBAR:
{<NN.*|JJS>*<NN.*>}
NP:
{<NBAR>}
{<NBAR><IN><NBAR>}
"""
chunker = nltk.RegexpParser(grammar)
text = sc.parallelize(textstring.split(' ')
dropURL=text.filter(lambda x: "URL" not in x)
words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))
tree = words.flatMap(lambda words: chunker.parse(nltk.tag.pos_tag(nltk.regexp_tokenize(words, TOKEN_RE))))
#data=tree.map(lambda word: (word,len(word))).filter(lambda t : t.label() =='NBAR') -- error
#data=tree.map(lambda x: (x,len(x)))##.filter(lambda t : t[0] =='NBAR')
#>>>[(Tree('NP', [Tree('NBAR', [('WASHINGTON', 'NN')])]), 1) Trying to get the length of all NP's and in descending order.
#data=tree.map(lambda x: (x,len(x))).reduceByKey(lambda x: x=='NBAR') ##this is an error but I am getting close I think
data=tree.map(lambda x: (x[0][0],len(x[0][0][0])))#.reduceByKey(lambda x : x[1] =='NP') ##Long run time.
things = data.collect()
things
You can add a type check for each entry to prevent errors:
result = (tree.filter(lambda t: isinstance(t, nltk.tree.Tree) and
t.label() == 'NP'
)
.map(lambda t: (len(t[0][0][0]), 1))
.reduceByKey(lambda x, y: x + y)
.sortByKey()
)
print(result.collect())
# [(2, 1), (3, 2), (4, 5), (5, 5), (7, 2), (8, 1), (9, 1)]