Search code examples
pythonjsonpandaslabel

Labeling sentences from different nested dictionaries


I created a function to extract sentences from a specific key in a nested file. Now I would like to include in this function a label each time it comes to a new dictionary.

Each time the the value HEADER appears marks the begining of a NEW story. So I would like to label the sentences that belong to the same story. And differentiate those that are different.

The data looks like the following:

sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
      {'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
      {'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
      {'c':'h', 'b': 'p'},
      {'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
      {'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
      {'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
      {'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
      {'c':'h', 'b': 'p'},
      {'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]

The function

def prhases_and_labels(data):
    a1 = [d for d in data if 'a1' in d]
    text = []
    for i in a1:
        text.append(i['a1']['a'])
    
    df = pd.DataFrame({'text': text})
    return df

enter image description here

The result that I would like to obtain (with the labels in a new column) enter image description here


Solution

  • You can iterate over the records and increment the label every time the c value is HEADER.

    sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
          {'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
          {'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
          {'c':'h', 'b': 'p'},
          {'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
          {'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
          {'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
          {'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
          {'c':'h', 'b': 'p'},
          {'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]
    
    
    def prhases_and_labels(data):
        label = 0
        res = {'text':[], 'label': []}
        for record in data:
            if 'a1' in record:
                line = record['a1']['a']
                if record.get('c') == 'HEADER':
                    label += 1
                    
                res['text'].append(line)
                res['label'].append(label)
                
        return pd.DataFrame(res)        
    

    Output:

    >>> prhases_and_labels(sentences)
    
                                                                     text  label
    0                                           Opus dei, la vie en rose.      1
    1                                   Ipsum lorem, Suspendisse posuere.      1
    2             Nulla elementum, augue fringilla tincidunt ullamcorper.      1
    3      Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.      1
    4                                       NEW Opus dei, la vie en rose.      2
    5                               NEW Ipsum lorem, Suspendisse posuere.      2
    6         NEW Nulla elementum, augue fringilla tincidunt ullamcorper.      2
    7  NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.      2