Search code examples
google-search-appliance

How to re-crawl documents that have an error status


We had an issue yesterday that prevented gsa crawler from loging in to our website to crawl. Because of this many of the URLs are indexed as the login page. I see a lot of results on the search page titled "Please log in" (title of the login page). Also when I check Index Diagnostics the crawl status for these URLs are "Retrying URL: Connection reset by peer during fetch.".

Now the login problem is resolved and once a page is re-crawled the crawl status goes to successful and it is picking up the page content and the search results show up with the proper title.. But since I cannot control what is being crawled there are pages that still haven't been re-crawled and still have the problem.

There is not a uniform URL that I can force a re-crawl. Hence my question: Is there a way to force a re-crawl based on the crawl status ("Retrying URL: Connection reset by peer during fetch.")? If that is to specific how about a re-crawl based on crawl status type (Errors/Successful/Excluded)?


Solution

    1. Export all the error url as csv file using "Index> Diagnostics > Index Diagnostics"

    2. Open CSV and apply filter on crawl status colum and get urls having the error you are looking for.

    3. Copy those urls and goto "Content Sources > Web Crawl > Freshness Tuning>Recrawl these URL Patterns" and paste and click on Recrawl

    That's it. You are done!

    PS: If error urls are more (>10000,If I am not wrong), you may not be able to get all of them in a single csv file. In that case you can do in batches.

    Regards,

    Mohan