Search code examples
htmlcss-selectorsjsoup

JSoup: get wikipedia page summary


I used MediaWiki API to get a wikipedia page, after getting html content I tried using

p:not(h2 ~ p)

to get page summary paragraphs, it should be paragraphs before table of contents element, it gets the wanted part but has additional paragraphs, where is the problem ?


Solution

  • p:not(h2 ~ p) gets every single paragraph on the page that doesn't have h2 before it in the same parent. This includes nested paragraphs, paragraphs outside the main content altogether, etc, because none of those paragraphs share the same parent element as h2 itself. You don't want those; you only want the paragraphs that appear just before h2 elements within their parent element.

    For that, you want to anchor the outer p selector to the parent element. The parent element you want is .mw-parser-output:

    .mw-parser-output > p:not(h2 ~ p)