Search code examples
machine-learningheuristicsdata-extraction

Machine Learning - Derive information from a text


I'm a newbie in the field of Machine Learning and Supervised learning.

My task is the following: from the name of a movie file on a disk, I'd like to retrieve some metadata about the file. I have no control on how the file is named, but it has a title and one or more additional info, like a release year, a resolution, actor names and so on.

Currently I have developed a rule heuristic-based system, where I split the name into tokens and try to understand what each word could represent, either alone or with adjacent ones. For detecting people names for example, I'm using a dataset of english names, and score the word as being a potential person's name if I find it in the dataset. If adjacent to it is a word that I scored as a potential surname, I score the two words as being an actor. And so on. It works with a decent accuracy, but changing heuristic scores manually to "teach" the system is tedious and unpredictable.

Such a rule-based system is hard to maintain or develop further, so, out of curiosity, I was exploring the field of machine learning. What I would like to know is:

  • Is there some kind of public literature about these kinds of problems?
  • Is ML a good way to approach the problem, given the limited data set available?
  • How would I proceed to debug or try to understand the results of such a machine? I already have problems with the "simplistic" heuristic engine I have developed..

Thanks, any advice would be appreciated.


Solution

  • You need to look into NLP (natural language processing). NLP deals with text processing and other things; for example entity recognition and tagging.

    Here is an example of using Spacy library: https://spacy.io/usage/linguistic-features.

    Some time ago I did a similar thing, you can see it here: https://github.com/Erlemar/Erlemar.github.io/blob/master/Notebooks/Fate_Zero_explore.ipynb