Search code examples
htmlxpathscrapyhtml-content-extraction

HTML XPath: Extracting text mixed in with multiple level and complex tags?


related questions before:

HTML XPath: Extracting text mixed in with multiple tags?

HTML XPath: Selectively avoiding tags when extracting text

//sorry for my poor English

I'm a beginner of writing web crawler, I'm trying to extract main content from a web pages(in Chinese) by xpath(though I have learned that there are algorithms both taditional and machine learning ways to extracting web main content) ,and I'm a very beginner at writing xpath rules.

I'm in faced with a web page that contains text mixed in complex tags,I summarize it as follows,where character(e.g. A,A2) means text only,'...' means more tags even nested without text.I want to get "AA2BB2CDEFGHIJKLMNOP"

...
<div id="artibody" class="art_context">
    <div align="center">...</div>
    <div align="center"><font>A</font>A2</div>
    <div align="left"><br><br><strong>B</strong>B2</div>
    <div align="left">
        <p>C<a>D</a>E</p>
        <p>F<a>G</a>H<a>I</a>J</p>K
    </div>
    <div align="center">...</div>
    <div align="center"><font>L</font></div>
    <p>M</p><!--M contains only text luckly-->
    <p>N</p>
    <p>O</p>
    <p>P<span>...</span><div class="shareBox">...</div>
    </p>
    <span id="arctTailMark"></span>
    <script>
        var page_navigation = document.getElementById('page_navigation');
        ...
    </script>
    <div style="padding:10px 0 30px 0">...</div>
</div>

Thanks for previous questions, I write a rule

'string(//div[@class=\"art_context\"])'

I get all content in plain text I want without tags ,but the js code in <script> is extracted as well.I tried the following,but it seems not helpful.There are still js codes in it .

'string(//div[@class=\"art_context\" and not(self::script)])'

The following one get "\r\n" only.

'//div[@class=\"art_context\" and not(self::script)]/text()'

Here are my questions:

1.How to write the xpath rule to meet my need : extracting content in div[@id="artibody"] except codes in <script>

2.Is the rule for question1 simple and powerful? Maybe I will meet more pages with a div[@id="artibody"] but the descendant nodes are quite different.

3.Any further suggestions on my task? Extracting web content from one website,but the main content lays in <div> with different id,class,and descendant node structure. I run the spider on my laptop(Intel corei5 3225,8G RAM) while using machine learning algorithms may decrease the crawl speed significantly.At the same time writing many xpath rule seems bothering.

I'd appreciate it if you could give me any suggestions on this question(and my English).


Solution

  • To get all descendant text nodes except the script contents, you can use this:

    //div[@class="art_context"]//*[not(self::script)]/text()
    

    In natural language: “Get all text nodes from descendants of all div[@class="art_context"] elements that are not script elements”.

    The // after div[@class="art_context"] is needed to select descendants, not just children.

    In comparison, the //div[@class="art_context" and not(self::script)]/text() expression in the question says “Get all text-node children of all div[@class="art_context"] elements that are not also script elements.”

    So the and not(self::script) part in the expression in the question is redundant, because all the expression is doing is selecting just //div[@class="art_context"] anyway, and then the /text() part is selecting only the text-node direct children of that div, which is just line breaks.

    Also, if instead of using XPath to just get the set of text nodes, you want to use XPath to get the result as a single string, you can use the functions string-join(…) and normalize-space(…):

    normalize-space(string-join(//div[@class="art_context"]//*[not(self::script)]/text(), ""))