Search code examples
javaweb-crawlerheritrix

Heritrix single-site scrape, including required off-site assets


I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules

I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.

For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.


Solution

  • You could use 3 decide rules:

    • The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule;
    • The second one accepts all urls in the current domain.
    • The third one rejects all pages not in the domain and not directly reached from the domain (the alsoCheckVia option)

    So something like that:

    <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
     <property name="rules">
      <list>
       <!-- Begin by REJECTing all... -->
       <bean class="org.archive.modules.deciderules.RejectDecideRule" />
    
       <bean class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
        <property name="decision" value="ACCEPT"/>
        <property name="regex" value="(?i)html|wml"/>
       </bean>
       <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
        <property name="decision" value="ACCEPT"/>
        <property name="surtsSource">
         <bean class="org.archive.spring.ConfigString">
          <property name="value">
           <value>
            http://(org,yoursite,
           </value>
          </property> 
         </bean>
        </property>
       </bean>
       <bean class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
        <property name="decision" value="REJECT"/>
        <property name="alsoCheckVia" value="true"/>
        <property name="surtsSource">
         <bean class="org.archive.spring.ConfigString">
          <property name="value">
           <value>
            http://(org,yoursite,
           </value>
          </property> 
         </bean>
        </property>
       </bean>
      </list>
     </property>
    </bean>