Search code examples
amazon-web-servicesamazon-s3amazon-ec2common-crawl

Access a common crawl AWS public dataset


I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.

How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?


Solution

  • Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data.

    If you want to download via HTTP, get one of the file locations, such as:

    common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

    and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link:

    https://commoncrawl.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

    To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool.

    This link will work and allow you to download the data without going through S3.