Search code examples
downloadldagensimcommon-crawl

How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)


For research purposes, I want a large (~100K) set of web pages, though I am only interested in their text. I plan to use them for gensim LDA topic model. CommonCrawler seems like a good place to start, but I am not sure how to do it. Could someone point the way how to download 100K text files or how to access them (if it's easier than downloading them)?


Solution

  • It seems it is possible to download only parts of the DataSet (you can just select the month you want), and you can download only the text (called WET files). for example, you can download the August 2014 Crawl Data from: http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/ and an explanation about the file format can be found here: http://blog.commoncrawl.org/2014/04/navigating-the-warc-file-format/