Search code examples
elasticsearchstormcrawler

stormcrawler: indexer.md.mapping - what happens if the metadata tag does not exist?


We have been having a weird issue with Stormcrawler 1.13. On some (but not all) of our sites, we have a <meta name="college" content="thiscollege"/> tag, and SC has the indexer.md.mapping set to - parse.college=college. This seems to work correctly for the sites that have that meta tag set.

The problem we are running into is that if metatag is set to thiscollege1 for pages 3.html, 4.html, and 5.html, then the crawler hits page25.html that does not have the meta tag, it appears to be re-using the value thiscollege1 for the meta tag from 5.html and just stuffing it into the college field in the Elastic index.

Is there a way to set that so that it zeroes out or unsets that variable every time it heads to a new page so that the variable is not carried over?

Any advice on how to tweak this setting would be most appreciated!

It's been a bugger of a problem to chase down, as some records just seem to have random entries in them. It wasn't till I matched up the records with some of the status records, sorted by NextFetchDate, that I saw that it could be a carried over variable. I am going to try to set up a specific test with just a couple pages to specifically prove/disprove the theory, but right now it's the only thing that fits what is happening.

Any ideas welcome!


Solution

  • This should happen only if you have listed parse.college in the values for the config metadata.transfer.