Search code examples
web-scrapingcommon-crawl

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database


I am attempting to create a database of Digital Object Identifier (DOI) found on the internet.

By manually searching the CommonCrawl Index Server manually I have obtained some promising results.

However I wish to develop a programmatic solution.

This may result in my process only requiring to read the index files and not the underlying WARC data files.

The manual steps I wish to automate are these:-

1). for each CommonCrawl Currently available index collection(s):

2). I search ... "Search a url in this collection: (Wildcards -- Prefix: http://example.com/* Domain: *.example.com) " e.g. link.springer.com/*

3). this returns almost 6MB of json data that contains approx 22K unique DOIs.

How can I browse all available CommonCrawl indexes instead of searching for specific URLs?

From reading the API documentation for CommonCrawl I cannot see how I can browse all the indexes to extract all DOIs for all domains.

UPDATE

I found this example java code https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java

that shows how to access a common crawl dataset.

However when I run it I receive this exception

"main" org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>common-crawl/crawl-data/CC-MAIN-2016-26/segments/1466783399106.96/warc/CC-MAIN-20160624154959-00160-ip-10-164-35-72.ec2.internal.warc.gz</Key><RequestId>1FEFC14E80D871DE</RequestId><HostId>yfmhUAwkdNeGpYPWZHakSyb5rdtrlSMjuT5tVW/Pfu440jvufLuuTBPC25vIPDr4Cd5x4ruSCHQ=</HostId></Error>

In fact every file I try to read results in the same error. Why is that?

what is the correct common crawl uri's for their datasets?


Solution

  • To get the example code to work replace lines 24 and 25 with:

    String fn = "crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
    S3Object f = s3s.getObject("commoncrawl", fn, null, null, null, null, null, null);
    

    Also note that the commoncrawl group have an updated example.