Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � •
An easy way to do this is to write a ParseFilter like
ParseData pd = parse.get(URL);
String text = pd.getText();
// remove chars
pd.setText(text);
This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples.