About training data for spaCy NER

I'm new to NLP. Now I'm trying to create NER model for extracting music artist's name from some text. But It hasn't gone well.This is what I've done.

I got 1500,000 artist's name list.
I created training data with string template.like this "{artist's name} is so sick." All 1500,000 training data is like this string.

TRAIN_DATA = [
    ("Nirvana is so sick", {"entities": [(0, 7, "ARTIST")]}),
    ("City girls is so sick", {"entities": [(0, 10, "ARTIST")]}),
    ("Taylor swift is so sick", {"entities": [(0, 12, "ARTIST")]}),
]

(Maybe this is the reason it doesn't go well?)

I used the model after training 30,000 datas.
But I didn't work at all.All sentence was extracted as ARTIST. Below is example. 'Chris Thomas King' is artist's name in this case.

Entities [('Not sure how they handled it during filming, but Tim Blake Nelson did sing his own parts (as did Chris Thomas King).', 'ARTIST')]

Do you have any idea? Thanks in advance.

Solution

I would approach this problem differently. The way you are generating training data is biased in my opinion. Below are the steps I would follow to generate the training data.

To generate the training data:

Start with the list of artist names that you have and initialize a PhraseMatcher. See here for details on how to do this.
Then, use the matcher to tag the names of artists in your sentences (30k sentences that you have)
The sentences that you found matches on, select say 2k of them. Each match will have start and end index of the match. Use this to generate the training data i.e. TRAIN_DATA

Once, this is done, you can train your model using the training data.

EDIT

You have 1.5 million artist names 👍
Next, you need now think of, "what context do I want my model to detect these names in?"
Once you have an answer to 2, it is time to generate training data. You should train your model in similar contexts in which you would want them to detect names. For example:
- If you want your model to detect artist names in news headlines, you should collect 1k to 2k new headlines which have artist names in them.
- If you want you model to detect names in reddit comments, you should get 1k to 2k reddit comments which have the artist names in them.
- How do you know if these have artist names you ask? You have the 1.5M artist names, you can do dictionary lookup to verify.
Now that you have your context set, and good amount of training data set, it is time to "annotate" the data.
There are many ways to annotate
1. Manually annotate using an annotation software, Doccano, Prodigy, and many more
2. Use PhraseMatcher to match detect the names of artists just like you did in (3).
Now you have your annotated data; time to train the model.
Voila, you are done and now have a generalized model in hand. 🎉 🥳
Still unhappy with the results? 😐
- Add more training data, go to (3), repeat