Search code examples
pythonscrapyweb-crawlerlanguage-detection

python website language detection


i am writing a Bot that can just check thousands of website either they are in English or not.

i am using Scrapy (python 2.7 framework) for crawling each website first page ,

can some one suggest me which is the best way to check website language ,

any help would be appreciated.


Solution

  • Look into Natural Language Toolkit:

    NLTK: http://nltk.org/

    What you want to look into is using corpus to extract the default vocabulary set by NLTK:

    nltk.corpus.words.words()

    Then, compare your text with the above using difflib.

    Reference: http://docs.python.org/library/difflib.html

    Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.