Search code examples
pythonmachine-learningnamed-entity-recognition

Create BIO format to a sentence from a json file - To train NER model


I have a JSON file that'll be used as data for a NER model. It has a sentence and the relevant entities in that specific sentence. I want to create a function that will generate a BIO-labeled string for each sentence according to the entities

for example the following object from the JSON file

{
      "request": "I want to fly to New York on the 13.3",
      "entities": [
        {"start": 16, "end": 23, "text": "New York", "category": "DESTINATION"},
        {"start": 32, "end": 35, "text": "13.3", "category": "DATE"}
      ]
} 

"I want to fly to New York on the 13.3" The corresponding BIO label will be "O O O O O B-DESTINATION I-DESTINATION O O B-DATE" where B-category is the beginning of that category I-category stands for inside and O for outside.

I'm looking for a Python code to iterate on each object in the JSON file that will generate a BIO-label for it.

change the JSON format if necessary


Solution

  • This is just a quick implementation for the above task, and many optimizations are possible, which can be explored later, but at first glace here is the function:

    def BIO_converter(r, entities):
        to_replace = {} # needed to maintain all the NER to be replaced
        for i in entities:
            sub = r[i['start']+1:i['end']+2].split(' ') # 1 indexed values in entities
            if len(sub) > 1:
                vals = [f"B-{i['category']}"] + ([f"I-{i['category']}"] * (len(sub)-1))
            else:
                vals = [f"B-{i['category']}"]
    
            to_replace = to_replace | dict(zip(sub,vals))
    
        r = r.split(' ')
        r = [to_replace[i] if i in to_replace else 'O' for i in r ]
        return ' '.join(r)
    
    js = {
            "request": "I want to fly to New York on the 13.3",
            "entities": [
              {"start": 16, "end": 23, "text": "New York", "category": "DESTINATION"},
              {"start": 32, "end": 35, "text": "13.3", "category": "DATE"}
            ]
          }
    BIO_converter(js['request'], js['entities'])
    

    Should output:

    O O O O O B-DESTINATION I-DESTINATION O O B-DATE