Search code examples
web-crawlercrawler4jcategorization

Crawler4j downloading articles


I'm trying to download articles from news portals using Crawler4j. I would like to store them in folders under categories 'sport' 'science' 'health' or any other made by that portal. Url parsing isn't enough since some portals don't use categories in urls. Only idea I have is to make a tree and remember found links on the current page. Is there an easier way to do it?


Solution

  • You can parse the actual pages and using CSS tags, identify the title or the breadcrumb

    I would suggest using JSOUP for that.

    You will need to know the news site and which css tag is the breadcrumb css tag.