I am working with Python "Pattern.en" package that gives me the subject, object and other details about a particular sentence.
But I want to store this output into another variable or a Dataframe for further processing which I am not able to do so.
Any inputs on this will be helpful.
Sample code is mentioned below for reference.
from pattern.en import parse
from pattern.en import pprint
import pandas as pd
input = parse('I want to go to the Restaurant as I am hungry very much')
print(input)
I/PRP/B-NP/O want/VBP/B-VP/O to/TO/I-VP/O go/VB/I-VP/O to/TO/O/O the/DT/B-NP/O Restaurant/NNP/I-NP/O as/IN/B-PP/B-PNP I/PRP/B-NP/I-PNP am/VBP/B-VP/O hungry/JJ/B-ADJP/O very/RB/I-ADJP/O much/JJ/I-ADJP/O
pprint(input)
WORD TAG CHUNK ROLE ID PNP LEMMA
I PRP NP - - - -
want VBP VP - - - -
to TO VP ^ - - - -
go VB VP ^ - - - -
to TO - - - - -
the DT NP - - - -
Restaurant NNP NP ^ - - - -
as IN PP - - PNP -
I PRP NP - - PNP -
am VBP VP - - - -
hungry JJ ADJP - - - -
very RB ADJP ^ - - - -
much JJ ADJP ^ - - - -
Please note the output of both print and pprint statements. I am trying to store either one of them into a variable. It would be better if I can store the output of pprint statement into a Dataframe as it is printing in tabular format.
But when I try to do so I encounter the error mentioned below
df = pd.DataFrame(input)
ValueError: DataFrame constructor not properly called!
Taking source of table function, I come out with this
from pattern.en import parse
from pattern.text.tree import WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA, IOB, ROLE, MBSP, Text
import pandas as pd
def sentence2df(sentence, placeholder="-"):
tags = [WORD, POS, IOB, CHUNK, ROLE, REL, PNP, ANCHOR, LEMMA]
tags += [tag for tag in sentence.token if tag not in tags]
def format(token, tag):
# Returns the token tag as a string.
if tag == WORD : s = token.string
elif tag == POS : s = token.type
elif tag == IOB : s = token.chunk and (token.index == token.chunk.start and "B" or "I")
elif tag == CHUNK : s = token.chunk and token.chunk.type
elif tag == ROLE : s = token.chunk and token.chunk.role
elif tag == REL : s = token.chunk and token.chunk.relation and str(token.chunk.relation)
elif tag == PNP : s = token.chunk and token.chunk.pnp and token.chunk.pnp.type
elif tag == ANCHOR : s = token.chunk and token.chunk.anchor_id
elif tag == LEMMA : s = token.lemma
else : s = token.custom_tags.get(tag)
return s or placeholder
columns = [[format(token, tag) for token in sentence] for tag in tags]
columns[3] = [columns[3][i]+(iob == "I" and " ^" or "") for i, iob in enumerate(columns[2])]
del columns[2]
header = ['word', 'tag', 'chunk', 'role', 'id', 'pnp', 'anchor', 'lemma']+tags[9:]
if not MBSP:
del columns[6]
del header[6]
return pd.DataFrame(
[[x[i] for x in columns] for i in range(len(columns[0]))],
columns=header,
)
Usage
>>> string = parse('I want to go to the Restaurant as I am hungry very much')
>>> sentence = Text(string, token=[WORD, POS, CHUNK, PNP])[0]
>>> df = sentence2df(sentence)
>>> print(df)
word tag chunk role id pnp lemma
0 I PRP NP - - - -
1 want VBP VP - - - -
2 to TO VP ^ - - - -
3 go VB VP ^ - - - -
4 to TO - - - - -
5 the DT NP - - - -
6 Restaurant NNP NP ^ - - - -
7 as IN PP - - PNP -
8 I PRP NP - - PNP -
9 am VBP VP - - - -
10 hungry JJ ADJP - - - -
11 very RB ADJP ^ - - - -
12 much JJ ADJP ^ - - - -