Search code examples
pythonmetadataoai

Metadata Harvesting


I'm trying to use the metadata harvesting package https://pypi.python.org/pypi/pyoai to harvest the data on this site https://www.duo.uio.no/oai/request?verb=Identify

I tried the example on the pyaoi site, but that did not work. When I test it I get a error. The code is:

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://uni.edu/ir/oaipmh'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

for record in client.listRecords(metadataPrefix='oai_dc'):
    print record

This is the stack trace:

Traceback (most recent call last):
  File "/Users/arashsaidi/PycharmProjects/get-new-DUO/get-files.py", line 8, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc'):
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 115, in method
    return obj(self, **kw)
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 110, in __call__
    return bound_self.handleVerb(self._verb, kw)
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 65, in handleVerb
    kw, self.makeRequestErrorHandling(verb=verb, **kw))    
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 273, in makeRequestErrorHandling
    raise error.XMLSyntaxError(kw)
oaipmh.error.XMLSyntaxError: {'verb': 'ListRecords', 'metadataPrefix': 'oai_dc'}

I need to get access to all the files on the page I have linked to above plus generate an additional file with some metadata.

Any suggestions?


Solution

  • I ended up using the Sickle package, which I found to have much better documentation and easier to use:

    This code gets all the sets, and then retrieves each record from each set. This seems like the best solution given the fact that there are more than 30000 records to deal with. Doing it for each set gives more control. Hope this might help others out there. I have no idea why libraries use OAI, does not seem like a good way to organize data to me...

    # gets sickle from OAI
            sickle = Sickle('http://www.duo.uio.no/oai/request')
            sets = sickle.ListSets()  # gets all sets
            for recs in sets:
                for rec in recs:
                    if rec[0] == 'setSpec':
                        try:
                            print rec[1][0], self.spec_list[rec[1][0]]
                            records = sickle.ListRecords(metadataPrefix='xoai', set=rec[1][0], ignore_deleted=True)
                            self.write_file_and_metadata()
                        except Exception as e:
                            # simple exception handling if not possible to retrieve record
                            print('Exception: {}'.format(e))