Search code examples
pythonnlpclassificationnltkdocument-classification

Testing my classifier on a review


Okay so I have been able to train my movie review classifier using the NaiveBayes Algorithm. The task is to:

Test your classifier against a negative review of the walking dead. http://metro.co.uk/2017/02/27/the-walking-dead-season-7-episode-11-hostiles-and-calamities-wasnt-as-exciting-as-it-sounds-6473911/#mv-a

Now my book gave an example of classifying documents and it used classifier.classify(df)....now i understand this was document features and had to be tokenized etc.

My question: Is there some way to test my classifier against the review just using the url? Or do i have to highlight all the words of the review, store as a string or document then tokenize etc?


Solution

  • Your program can read the contents of a URL like this:

    with urllib.urlopen("http://example.com/review.html") as rec:
        data = rec.read()
    

    However, the URL you suggest points to an HTML document, so you'll need to "scrape" the contents (i.e., extract the body of the review and convert it to "plain text" by removing boldface etc.) before you go any further. For this you can use BeautifulSoup or something similar. (The NLTK used to have a scraping function but dropped it in favor of BeautifulSoup.) Unless you've already learned how to do this, it would indeed be simpler to grab a few test documents by copy-pasting them from your browser to a text-only editor like Notepad, which will remove all markup.