Search code examples
solrweb-crawlernutchmagnet-uri

How to crawl magnet links with Apache Nutch and Solr so that they're available in Solr query results?


I made myself familiar with crawling with Apache Nutch and Solr, but realized that while HTTP and HTTPS links are available in Solr query results in the content field magnet links are not. I adjusted conf/regex-urlfilter.txt to be

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# for linuxtracker.org
+^https?://*linuxtracker.org/(.+)*$
#+^magnet:\?xt=(.+)*$
    # causes magnet links to be ignored/not appear in content field
+^magnet:*$

# reject anything else
-.

and don't see why magnet links shouldn't be included inside content. As you can see, I'm investigating this using http://linuxtracker.org which e.g. has the magnet link magnet:?xt=urn:btih:ETDW2XT7HJ2Y6B4Y5G2YSXGC5GWJPF6P on http://linuxtracker.org/?page=torrent-details&id=24c76d5e7f3a758f0798e9b5895cc2e9ac9797cf.

After crawling with bin/crawl there're magnet links when querying Solr as follows in pysolr:

solr = pysolr.Solr(solr_core_url, timeout=10)
results = solr.search('*:*')
for result in results:
    print(result)

I'm using Apache Nutch release-1.13-73-g9446b1e1 and Solr 6.6.1 on Ubuntu 17.04.


Solution

  • Short answer magnet links are not "normal" links and not supported out of the box by Nutch.

    Long answer:

    The configuration that you've changed get's applied after the links are extracted, in this case, if you're using parse-html the parse plugin try to evaluate if the possible outlink is a valid link this basically just creates a java.net.URL.

    java.net.URL on the other hand doesn't support magnet links out of the box, according to the javadocs:

    Protocol handlers for the following protocols are guaranteed to exist on the search path :

     http, https, ftp, file, and jar
    

    If you're using parse-tika something similar is happening.

    One option could be to have your custom parser that handles this for you, keep in mind that in any case, you wouldn't want to follow (have as outlinks) the magnet links because Nutch would not be able to process those links.

    If you only want to have the links indexed in Solr/ES (for search), then you could write your own HtmlParseFilter and add those links in a separated field for instance.