I am training a custom model in Spacy to extract custom entities but while I need to provide an input train data that consists of my entities along with the index locations, I wanted to understand if there's a faster way to assign the index value for keywords I am looking for in a particular sentence in my training data.
An example of my traning data:
TRAIN_DATA = [
('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance,
{'entities': [(25, 37, 'BS'),(40, ,60, 'BS'),(62, 79, 'BS')]
})
]
Now to pass the index location for specific keywords in my training data, I am presently counting it manually to give the location of my keyword.
For example: in case of the first line where I am saying Behaviour skills include Communication etc, I am manually calculating the location of the index for the word "Communication" which is 25,37.
I am sure there must be another way to identify the location of these indices by some other methods instead counting it manually. Any ideas how can I achieve this?
Using str.find()
can help here. However, you have to loop through both sentences and keywords
keywords = ['Communication', 'Conflict Resolution', 'Work Life Balance']
texts = ['Behaviour Skills include Communication, Conflict Resolution, Work Life Balance',
'Some sentence where lower case conflict resolution is included']
LABEL = 'BS'
TRAIN_DATA = []
for text in texts:
entities = []
t_low = text.lower()
for keyword in keywords:
k_low = keyword.lower()
begin = t_low.find(k_low) # index if substring found and -1 otherwise
if begin != -1:
end = begin + len(keyword)
entities.append((begin, end, LABEL))
TRAIN_DATA.append((text, {'entities': entities}))
Output:
[('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance',
{'entities': [(25, 38, 'BS'), (40, 59, 'BS'), (61, 78, 'BS')]}),
('Some sentence where lower case conflict resolution is included',
{'entities': [(31, 50, 'BS')]})]
I added str.lower()
just in case you might need it.