Search code examples
parsingnutchapache-tikanutch2

Apache Nutch title parsing issue for Language specific websites


I have configured apache Nutch 2.3.1 with Hadoop 2.7.5 and Hbase 0.98. I have to crawl some Urdu websites. I am using its default parsers i.e., html, tika. Some documents have title in Urdu that are ok but some documents have title in Urdu and their heading 1 i.e., h1 have the original title e.g. bbc-page. Similarly, there are some cases where meta tags have relvement title. Is there any builtin option (parser) that can handle this option so that it should select h1 for title if available.

Or if I have to do it, what are possible ways for this purpose.


Solution

  • Nutch will use the title tag if present found in the DOM tree (https://github.com/apache/nutch/blob/bb2a7adddbc5c780151bb9957d68af52be7339ca/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L251) for this you would need to write a custom logic in a parser plugin. But the real question would be how would you identify the "bad" title tag? Would be some specific content (like the URL).

    In any case, you'll need to write your own plugin either in the parser or in an indexing plugin (like taking a field and copying it over to the title field in certain conditions).