Get an item value from a nested dictionary inside the rows of a pandas df and get rid off the rest

I implemented allennlp's OIE, which extracts subject, predicate, object information (in the form of ARG0, V, ARG1 etc) embedded in nested strings. However, I need to make sure that each output is linked to the given ID of the original sentence.

I produced the following pandas dataframe, where OIE output contains the raw output of the allennlp algorithm.

Current output:

sentence	ID	OIE output
'The girl went to the cinema'	'abcd'	{'verbs':[{'verb': 'went', 'description':'[ARG0: The girl] [V: went] [ARG1:to the cinema]'}]}
'He is right and he is an engineer'	'efgh'	{'verbs':[{'verb': 'is', 'description':'[ARG0: He] [V: is] [ARG1:right]'}, {'verb': 'is', 'description':'[ARG0: He] [V: is] [ARG1:an engineer]'}]}

My code to get the above table:

oie_l = []

for sent in sentences:
  oie_pred = predictor_oie.predict(sentence=sent) #allennlp oie predictor
  for d in oie_pred['verbs']: #get to the nested info
    d.pop('tags') #remove unnecessary info
  oie_l.append(oie_pred)

df['OIE out'] = oie_l #add new column to df

Desired output:

sentence	ID	OIE Triples
'The girl went to the cinema'	'abcd'	'[ARG0: The girl] [V: went] [ARG1:to the cinema]'
'He is right and he is an engineer'	'efgh'	'[ARG0: He] [V: is] [ARG1:right]'
'He is right and he is an engineer'	'efgh'	'[ARG0: He] [V: is] [ARG1:an engineer]'

Approach idea:

To get to the desired output of 'OIE Triples' , I was considering transforming the initial 'OIE output' into a string and then using regular expression to extract the ARGs. However, I am not sure if this is the best solution, as the 'ARGs' can vary. Another approach, would be to iterate to the nested values of description: , replace what is currently in the OIE output in the form of a list and then implement df.explode() method to expand it, so that the right sentence and id columns are linked to the triple after 'exploding'.

Any advice is appreciated.

Solution

Your second idea should do the trick:

import ast
df["OIE Triples"] = df["OIE output"].apply(ast.literal_eval)

df["OIE Triples"] = df["OIE Triples"].apply(lambda val: [a_dict["description"]
                                                         for a_dict in val["verbs"]])
df = df.explode("OIE Triples").drop(columns="OIE output")

In case "OIE output" values are not truly dicts but strings, we convert them to dicts via ast.literal_eval. (so if they are dicts, you can skip the first 2 lines).

Then we get a list for each value of the series that is composed of "description"s of the outermost dict key'ed by "verbs".

Finally explode this description lists and drop the "OIE output" column as it is no longer needed.

to get

                              sentence      ID                                      OIE Triples
0        'The girl went to the cinema'  'abcd'  [ARG0: The girl] [V: went] [ARG1:to the cinema]
1  'He is right and he is an engineer'  'efgh'                  [ARG0: He] [V: is] [ARG1:right]
1  'He is right and he is an engineer'  'efgh'            [ARG0: He] [V: is] [ARG1:an engineer]