Search code examples
pythonnltkmwe

Python NLTK Train Data Set


I'm trying to train my NLTK model to recognize movie names (ex. "game of thrones") I have a text file where each line is a movie name.

How do I train my NLTK model to recognize these movie names if it sees it in a sentence during tokenization?

I searched around but found no resources. Any help is appreciated


Solution

  • It sounds like you are talking about training a named entity recognition (NER) model for movie names. To train an NER model in the traditional way, you'll need more than a list of movie names - you'll need a tagged corpus that might look something like the following (based on the 'dataset format' here):

    I PRP O
    like VBP O
    the DT O
    movie NN O
    Game NN B-MOV
    of IN I-MOV
    Thrones NN I-MOV
    . Punc O
    

    but going on for a very long time (say, minimum 10,000 words to give enough examples of movie names in running text). Each word is following by the part-of-speech (POS) tag, and then the NER tag. B-MOV indicates that 'Game' is the start of a movie name, and I-MOV indicates that 'of' and 'Thrones' are 'inside' a movie name. (By the way, isn't Game of Thrones a TV series as opposed to a movie? I'm just reusing your example anyway...)

    How would you create this dataset? Annotating by hand. It is a laborious process, but this is how state-of-the-art NER systems are trained, because whether or not something should be detected as a movie name depending on the context in which it appears. For example, there is a Disney movie called 'Aliens', but the same word 'Aliens' is a movie title in the second sentence below but not the first.

    1. Aliens are hypothetical beings from other planets.
    2. I went to see Aliens last week.

    Tools like docanno exist to aid the annotation process. The dataset to be annotated should be selected depending on the final use case. For example, if you want to be able to find movie names in news articles, use a corpus of news articles. If you want to be able to find movie names in emails, use emails. If you want to be able to find movie names in any type of text, use a corpus with a wide range of different types of texts.

    This is a good place to get started if you decide to stick with training and NER model using NLTK, although some of the answers here suggest other libraries you might want to use, such as spaCy.

    Alternatively, if the whole tagging process sounds like too much work and you just want to use your list of movie names, look at fuzzy string matching. In this case, I don't think NLTK is the library to use as I'm not aware of any fuzzy string matching features in NLTK. You might instead use fuzzysearch as per the answer here.