Search code examples
pythonjsondictionarynlptoken

tokenizing text with features in specif format


Hello there I am trying to create tokens with some features and arrange them in some kind of JSON format, using the following text example:

words = ['The study of aviation safety report in the aviation industry usually relies', 
         'The experimental results show that compared with traditional',
         'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
{"sentence": [
           {
             indexSentence:0,
             tokens: [{
                       "indexWord": 1,
                        "word": "The",
                         "len": 3
                      },
                      { "indexWord": 2,
                        "word": "study",
                         "len": 5},
                      {"indexWord": 3,
                        "word": "of",
                         "len": 2
                       },
                       {"indexWord": 4,
                        "word": "aviation",
                         "len": 8},
                        ...
                        ]
           },
           {
            "indexSentence" : 1,
            "tokens" : [{
                        ...
                        }]
           },
           ....
         ]}

I trying to use the following code with no success...

t_d = {len(i):i for i in words}

[{'Lon' : len(t_d[i]),
  'tex' : t_d[i], 
  'Sub' : [{'index' : j,
            'token': [{
                      'word':['word: ' + j for i,j in enumerate(str(t_d[i]).split(' '))] 
                      }],
            'lenTo' : len(str(t_d[i]).split(' '))
           }
          ],
  'Sub1':[{'index' : j}]
 } for j,i in enumerate(t_d)]

Solution

  • The solution below assumes that the tokenization splits the sentence by whitespace using the str.split function. The solution should still be able to work with any other tokenize function.

    from collections import defaultdict
    
    words = ['The study of aviation safety report in the aviation industry usually relies', 
             'The experimental results show that compared with traditional',
             'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
    
    sentence = defaultdict(list)
    
    for idx,i in enumerate(words):
        struct = {"indexSentence":idx,"tokens":[{"indexWord":idx_w,
                                                 "word":w,
                                                 "len":len(w)} for idx_w, w in enumerate(i.split())]}
        sentence['sentence'].append(struct)
        
    dict(sentence)
    
    >>
    {'sentence': [{'indexSentence': 0,
       'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
        {'indexWord': 1, 'word': 'study', 'len': 5},
        {'indexWord': 2, 'word': 'of', 'len': 2},
        {'indexWord': 3, 'word': 'aviation', 'len': 8},
        {'indexWord': 4, 'word': 'safety', 'len': 6},
        {'indexWord': 5, 'word': 'report', 'len': 6},
        {'indexWord': 6, 'word': 'in', 'len': 2},
        {'indexWord': 7, 'word': 'the', 'len': 3},
        {'indexWord': 8, 'word': 'aviation', 'len': 8},
        {'indexWord': 9, 'word': 'industry', 'len': 8},
        {'indexWord': 10, 'word': 'usually', 'len': 7},
        {'indexWord': 11, 'word': 'relies', 'len': 6}]},
      {'indexSentence': 1,
       'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
    ...
    }
    

    You can leverage defaultdict to first create your list or array and then append the desired structure on top. To mimic a json structure you can turn in back to a dict.