i am writing a Bot that can just check thousands of website either they are in English or not.
i am using Scrapy (python 2.7 framework) for crawling each website first page ,
can some one suggest me which is the best way to check website language ,
any help would be appreciated.
Look into Natural Language Toolkit
:
NLTK: http://nltk.org/
What you want to look into is using corpus
to extract the default vocabulary set by NLTK
:
nltk.corpus.words.words()
Then, compare your text with the above using difflib
.
Reference: http://docs.python.org/library/difflib.html
Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.