This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
designing
so it is not relevant" when it really was!designing
and design
should be treated the same way, as well as misspellings of words, such as programing
.Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.