Search code examples
python-3.xnlpspacyindicesnamed-entity-recognition

Finding the Start and End char indices in Spacy


I am training a custom model in Spacy to extract custom entities but while I need to provide an input train data that consists of my entities along with the index locations, I wanted to understand if there's a faster way to assign the index value for keywords I am looking for in a particular sentence in my training data.

An example of my traning data:

TRAIN_DATA = [

('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance,
 {'entities': [(25, 37, 'BS'),(40, ,60, 'BS'),(62, 79, 'BS')]
 })
            ]

Now to pass the index location for specific keywords in my training data, I am presently counting it manually to give the location of my keyword.

For example: in case of the first line where I am saying Behaviour skills include Communication etc, I am manually calculating the location of the index for the word "Communication" which is 25,37.

I am sure there must be another way to identify the location of these indices by some other methods instead counting it manually. Any ideas how can I achieve this?


Solution

  • Using str.find() can help here. However, you have to loop through both sentences and keywords

    keywords = ['Communication', 'Conflict Resolution', 'Work Life Balance']
    texts = ['Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
            'Some sentence where lower case conflict resolution is included']
    
    LABEL = 'BS'
    TRAIN_DATA = []
    
    for text in texts:
        entities = []
        t_low = text.lower()
        for keyword in keywords:
            k_low = keyword.lower()
            begin = t_low.find(k_low) # index if substring found and -1 otherwise
            if begin != -1:
                end = begin + len(keyword)
                entities.append((begin, end, LABEL))
        TRAIN_DATA.append((text, {'entities': entities}))
    

    Output:

    [('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
    {'entities': [(25, 38, 'BS'), (40, 59, 'BS'), (61, 78, 'BS')]}), 
    ('Some sentence where lower case conflict resolution is included', 
    {'entities': [(31, 50, 'BS')]})]
    

    I added str.lower() just in case you might need it.