Search code examples
search-enginecommon-crawl

CommonCrawl: How to find a specific web page?


I am using CommonCrawl to restore pages I should have achieved but have not.

In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.

A simple script downloads all indices from the available crawls:

./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on

Afterwards I have 112mb of data and simply grep:

grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r

The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?

Update: Thanks to Sebastian, two links are left... Two URLs are:

They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...


Solution

  • You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

    enter image description here