Search code examples
google-sheetsweb-scrapinggoogle-sheets-formula

Extract all <li> with the class="myTag" but without <small>


I am struggling with the ImportXML function from Google Spreadsheet. A specific webpage has this content:

<li class="myTag">
    <a href="/xyz/abc/">
        Jouets
        <small class="hidden">
            <span>235&nbsp;6453&nbsp;sous-section</span>
            <span>1231&nbsp;765&nbsp;produits</span>
            <span>12155&nbsp;654&nbsp;images</span>
        </small>
    </a>
</li>
<li class="myTag">
    <a href="/xyz/def/">
        Livres
        <small class="hidden">
            <span>235&nbsp;6453&nbsp;sous-section</span>
            <span>1231&nbsp;765&nbsp;produits</span>
            <span>12155&nbsp;654&nbsp;images</span>
        </small>
    </a>
</li>

I want to extract all li tags with the class="myTag" but without the <small> section, in order to have this list:

  • Jouets
  • Livres

I have written this XPath expression:

=IMPORTXML(B1; '//li[@class='myTag'].ReplaceAll("<[^>]*>", "", "")')

But I have the #N/A error.

How can I extract just the first elements (Jouets, Livres) without the <small> section?


Originality of this question

Scraping data to Google Sheets from a website that uses JavaScript is about dynamic web pages which is not my case. My page is already generated in HTML content - I can already extract the full content within the <li> tag with the parameter '//li[@class='myTag']'.

My point is to exclude the <small> content which is glued.
I am looking to get this list:

  • Jouets
  • Livres

and not this one:

  • Jouets235 6453 sous-section1231 765 produits12155 654 images
  • Livres235 6453 sous-section1231 765 produits12155 654 images

Solution

  • From your following reply,

    Currently, with this formula I have this result (just pasting the first lines): - Arts1 069 sous-thèmes15 778 articles11 949 médias ; - Astronomie et astrophysique107 sous-thèmes570 articles696 médias ; - Chimie147 sous-thèmes1 042 articles1 087 médias But I am attempting to get only: - Arts ; - Astronomie et astrophysique ; - Chimie

    How about the following XPath?

    Modified XPath:

    //li[@class='with-counter']/a/text()
    

    Modified formula:

    =IMPORTXML("https://www.universalis.fr/classification/"; "//li[@class='with-counter']/a/text()")
    

    Testing:

    enter image description here