python-3.x nlp spacy indices named-entity-recognition

Finding the Start and End char indices in Spacy

I am training a custom model in Spacy to extract custom entities but while I need to provide an input train data that consists of my entities along with the index locations, I wanted to understand if there's a faster way to assign the index value for keywords I am looking for in a particular sentence in my training data.

An example of my traning data:

TRAIN_DATA = [

('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance,
 {'entities': [(25, 37, 'BS'),(40, ,60, 'BS'),(62, 79, 'BS')]
 })
            ]

Now to pass the index location for specific keywords in my training data, I am presently counting it manually to give the location of my keyword.

For example: in case of the first line where I am saying Behaviour skills include Communication etc, I am manually calculating the location of the index for the word "Communication" which is 25,37.

I am sure there must be another way to identify the location of these indices by some other methods instead counting it manually. Any ideas how can I achieve this?

Solution

Using str.find() can help here. However, you have to loop through both sentences and keywords

keywords = ['Communication', 'Conflict Resolution', 'Work Life Balance']
texts = ['Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
        'Some sentence where lower case conflict resolution is included']

LABEL = 'BS'
TRAIN_DATA = []

for text in texts:
    entities = []
    t_low = text.lower()
    for keyword in keywords:
        k_low = keyword.lower()
        begin = t_low.find(k_low) # index if substring found and -1 otherwise
        if begin != -1:
            end = begin + len(keyword)
            entities.append((begin, end, LABEL))
    TRAIN_DATA.append((text, {'entities': entities}))

Output:

[('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
{'entities': [(25, 38, 'BS'), (40, 59, 'BS'), (61, 78, 'BS')]}), 
('Some sentence where lower case conflict resolution is included', 
{'entities': [(31, 50, 'BS')]})]

I added str.lower() just in case you might need it.