Search code examples
web-crawlerstormcrawler

Explicit special characters from crawling


Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � •


Solution

  • An easy way to do this is to write a ParseFilter like

            ParseData pd = parse.get(URL);
            String text = pd.getText();
            // remove chars
            pd.setText(text);
    

    This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples.