Search code examples
ibm-cloudibm-watsonwatson-discoverywatson-assistant

IBM Watson Discovery crawling issue


We want to index our client website and store all the data in IBM Watson Discovery service. When user asks question related to client data then (we will connect discovery with Watson Assistant). The chatbot should connect to Discovery and fetch the data to respond.

Problem: The client website has multiple links and each link will have further links, we want crawl all the data from website and index and store it in Watson Discovery service. We tried crawling the site but Discovery service is taking much time to crawl the site and also its not completed the task after 1 week also. Please let us know how we can achieve this in better and faster way.


Solution

  • Note that the web crawling is a current beta and the Watson Discovery documentation for web crawl states that, depending on the website, it will not ingest all data.

    I used the web crawl in Discovery in a similar scenario like yours and query my website using a chat built with Watson Assistant. What you should do:

    • increase the number of hops: how deep should Watson Discovery crawl your website
    • depending on your website: add multiple entry points
    • specify all the paths that you want to exclude. I added those that would add duplicate entries and those generated summary pages, RSS feeds, etc.
    • adjust how often it should crawl
    • check that Watson Discovery can access your website and that your website does not block crawling