Search code examples
google-search-appliancedspace

Using a connector to crawl content using a sitemap.xml


We have a dspace repository of research publications that the gsa is indexing via a web crawl, ie start at the homepage and follow all the links.

I'm thinking that using a connector to submit urls for indexing from sitemap.xml file, might be more efficient. The gsa would then only need to index and recrawl the urls on the sitemap and could ignore the result of the site.

The suggestion from the gsa documentation is that this is not really a target for a connector, as the content can all be discovered by a web crawl.

What do you think?

Thanks, Georgina.


Solution

  • This might be outdated (so I'm not sure if it still work), but there's an example of a python connector that will parse a sitemap.xml and send it as Content Feed or Metadata feed. Here are 2 links to help you https://github.com/google/gsa-admin-toolkit/blob/master/connectormanager/sitemap_connector.py

    https://github.com/google/gsa-admin-toolkit/wiki/ConnectorManagerDocumentation

    If anything, this will give you an idea of the logic to implement if you write your own Connector 3.x or Adaptor 4.x