I implemented allennlp's OIE, which extracts subject, predicate, object information (in the form of ARG0, V, ARG1 etc) embedded in nested strings. However, I need to make sure that each output is linked to the given ID
of the original sentence.
I produced the following pandas dataframe, where OIE output
contains the raw output of the allennlp algorithm.
Current output:
sentence | ID | OIE output |
---|---|---|
'The girl went to the cinema' | 'abcd' | {'verbs':[{'verb': 'went', 'description':'[ARG0: The girl] [V: went] [ARG1:to the cinema]'}]} |
'He is right and he is an engineer' | 'efgh' | {'verbs':[{'verb': 'is', 'description':'[ARG0: He] [V: is] [ARG1:right]'}, {'verb': 'is', 'description':'[ARG0: He] [V: is] [ARG1:an engineer]'}]} |
My code to get the above table:
oie_l = []
for sent in sentences:
oie_pred = predictor_oie.predict(sentence=sent) #allennlp oie predictor
for d in oie_pred['verbs']: #get to the nested info
d.pop('tags') #remove unnecessary info
oie_l.append(oie_pred)
df['OIE out'] = oie_l #add new column to df
Desired output:
sentence | ID | OIE Triples |
---|---|---|
'The girl went to the cinema' | 'abcd' | '[ARG0: The girl] [V: went] [ARG1:to the cinema]' |
'He is right and he is an engineer' | 'efgh' | '[ARG0: He] [V: is] [ARG1:right]' |
'He is right and he is an engineer' | 'efgh' | '[ARG0: He] [V: is] [ARG1:an engineer]' |
Approach idea:
To get to the desired output of 'OIE Triples' , I was considering transforming the initial 'OIE output' into a string and then using regular expression to extract the ARGs. However, I am not sure if this is the best solution, as the 'ARGs' can vary. Another approach, would be to iterate to the nested values of description:
, replace what is currently in the OIE output in the form of a list and then implement df.explode()
method to expand it, so that the right sentence and id columns are linked to the triple after 'exploding'.
Any advice is appreciated.
Your second idea should do the trick:
import ast
df["OIE Triples"] = df["OIE output"].apply(ast.literal_eval)
df["OIE Triples"] = df["OIE Triples"].apply(lambda val: [a_dict["description"]
for a_dict in val["verbs"]])
df = df.explode("OIE Triples").drop(columns="OIE output")
In case "OIE output"
values are not truly dict
s but str
ings, we convert them to dict
s via ast.literal_eval
. (so if they are dict
s, you can skip the first 2 lines).
Then we get a list for each val
ue of the series that is composed of "description"
s of the outermost dict key'ed by "verbs"
.
Finally explode
this description
lists and drop
the "OIE output"
column as it is no longer needed.
to get
sentence ID OIE Triples
0 'The girl went to the cinema' 'abcd' [ARG0: The girl] [V: went] [ARG1:to the cinema]
1 'He is right and he is an engineer' 'efgh' [ARG0: He] [V: is] [ARG1:right]
1 'He is right and he is an engineer' 'efgh' [ARG0: He] [V: is] [ARG1:an engineer]