Search code examples
javascriptjquerynode.jshrefcheerio

jQuery Only Returning First .attr('href');


I'm trying to crawl a webpage using node and cheerio. Everything is returning as I expect except for the hrefs.

I'm successfully returning values for 'headers' .find('h3').text() and 'descriptions' .find('a').text() but for 'links' .find('a').attr('href'); only the first is being returned. This confuses me as the text 'descriptions' are within the same anchor.

I've found that if I remove the .attr('href'); and just return .find('a') the link text (href) is displayed as expected. I can modify the returned value and make this work if need be but would prefer to do this correctly.

Script:

const cheerio = require("cheerio");
const axios = require("axios");

axios.get("http://localhost:8000/sample_page_2.html").then(urlResponse => {
    const $ = cheerio.load(urlResponse.data);

    $('div.tos-post-type').each((i, element) => {

        const header = $(element)
            .find('h3')
            .text()
            .trim();
        console.log('------------------------------------------------------------------------------------');
        console.log('HEADER: ' + header);

        const link = $(element)
            .find('a')
            .attr('href');

        console.log('\nLINK(s): \n' + link);

        const description = $(element)
            .find('a')
            .text();

        console.log('\nDESCRIPTION(s): \n' + description + '\n');
        console.log('------------------------------------------------------------------------------------');
    });
});

Here is a snippet of the page I'm trying to crawl:

<div class="container tos-archive">
    <div class="row justify-content-center">
        <div class="col-lg-10">
            <div class="row">
                <div class="col-lg-6">
                    <div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
                        <div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/legal.svg )"></div>
                        <h3>
                            Legal </h3>
                        <a href="https://www.example_domain.com/legal/terms-conditions/">
                            Terms &amp; Conditions </a>
                        <a href="https://www.example_domain.com/legal/service-providers/">
                            Service Providers </a>
                    </div>
                </div>
                <div class="col-lg-6">
                    <div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
                        <div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/policy.svg )"></div>
                        <h3>
                            Policies </h3>
                        <a target="" href="https://www.example_domain.com/privacy-policy/">
                            Privacy Policy </a>
                        <a target="" href="https://store.example_domain.com/EXHM/store?Action=DisplayEXCookiesPolicyPage">
                            Cookie Policy </a>
                    </div>
                </div>
                <div class="col-lg-6">
                    <div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
                        <div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/clip-dark.svg )"></div>
                        <h3>
                            <a href="https://www.example_domain.com/compliance/">
                                Compliance </a>
                        </h3>
                        <a href="https://www.example_domain.com/compliance/ccpa/">
                            California Consumer Privacy Act (CCPA) </a>
                        <a href="https://www.example_domain.com/compliance/disaster-recovery/">
                            Disaster Recovery </a>
                        <a href="https://www.example_domain.com/compliance/gdpr/">
                            GDPR </a>
                        <a href="https://www.example_domain.com/compliance/pci-dss/">
                            PCI DSS </a>
                        <a href="https://www.example_domain.com/compliance/privacymark/">
                            PrivacyMark </a>
                        <a class="tos-view-all" href="https://www.example_domain.com/compliance/">
                            View All </a>
                    </div>
                </div>
                <div class="col-lg-6">
                    <div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
                        <div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/mouse.svg )"></div>
                        <h3>
                            Other </h3>
                        <a href="https://www.example_domain.com/legal-other/eu-standard-solutions/">
                            EU Standard Solutions </a>
                        <a href="https://www.example_domain.com/legal-other/eu-standard-service-providers/">
                            EU Standard Service Providers </a>
                        <a href="https://www.example_domain.com/legal-other/data-exhibit/">
                            Data Exhibit </a>
                        <a href="https://www.example_domain.com/legal-other/data-standards/">
                            Data Standards </a>
                        <a href="https://www.example_domain.com/legal-other/payment-addenda/">
                            Payment Addenda </a>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>

Here's a snippet of actual results:

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Policies

LINK(s):
https://www.example_domain.com/privacy-policy/

DESCRIPTION(s):

                            Privacy Policy
                            Cookie Policy

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Compliance

LINK(s):
https://www.example_domain.com/compliance/

DESCRIPTION(s):

                                Compliance
                            California Consumer Privacy Act (CCPA)
                            Disaster Recovery
                            GDPR
                            PCI DSS
                            PrivacyMark
                            View All

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------

Here is what I am expecting (multiple links):

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Policies

LINK(s):
https://www.example_domain.com/privacy-policy/
https://store.example_domain.com/EXHM/store?Action=DisplayEXCookiesPolicyPage

DESCRIPTION(s):

                            Privacy Policy
                            Cookie Policy

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Compliance

LINK(s):
https://www.example_domain.com/compliance/
https://www.example_domain.com/compliance/ccpa/
https://www.example_domain.com/compliance/disaster-recovery/
https://www.example_domain.com/compliance/gdpr/
https://www.example_domain.com/compliance/pci-dss/
https://www.example_domain.com/compliance/privacymark/
https://www.example_domain.com/compliance/

DESCRIPTION(s):

                                Compliance
                            California Consumer Privacy Act (CCPA)
                            Disaster Recovery
                            GDPR
                            PCI DSS
                            PrivacyMark
                            View All

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------

Any ideas what I'm doing incorrectly?

Thanks!


Solution

  • Use map to get multiple attributes:

    $(element).find('a').get().map(a => $(a).attr('href'))