I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules
I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.
For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.
You could use 3 decide rules:
ContentTypeNotMatchesRegexDecideRule
; So something like that:
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule" />
<bean class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="regex" value="(?i)html|wml"/>
</bean>
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="surtsSource">
<bean class="org.archive.spring.ConfigString">
<property name="value">
<value>
http://(org,yoursite,
</value>
</property>
</bean>
</property>
</bean>
<bean class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
<property name="decision" value="REJECT"/>
<property name="alsoCheckVia" value="true"/>
<property name="surtsSource">
<bean class="org.archive.spring.ConfigString">
<property name="value">
<value>
http://(org,yoursite,
</value>
</property>
</bean>
</property>
</bean>
</list>
</property>
</bean>