Search code examples
javascripthtmlweb-scrapingcheerio

Cheerio get text from non-unique HTML class (JS)


I am trying to scrape information from a website with the following HTML format:

<tr class="odd">
<td class="">    <table class="inline-table">
        <tbody><tr>
            <td rowspan="2">
                <img src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" data-src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" title="Darío Osorio" alt="Darío Osorio" class="bilderrahmen-fixed lazy entered loaded" data-ll-status="loaded">            </td>
            <td class="hauptlink">
                <a title="Darío Osorio" href="/dario-osorio/profil/spieler/881116">Darío Osorio</a>                            </td>
        </tr>
        <tr>
            <td>Right Winger</td>
        </tr>
    </tbody></table>
</td><td class="zentriert">18</td><td class="zentriert"><img src="https://tmssl.akamaized.net/images/flagge/verysmall/33.png?lm=1520611569" title="Chile" alt="Chile" class="flaggenrahmen"></td><td class=""><table class="inline-table">
    <tbody><tr>
        <td rowspan="2">
            <a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037"><img src="https://tmssl.akamaized.net/images/wappen/tiny/1037.png?lm=1420190110" title="Club Universidad de Chile" alt="Club Universidad de Chile" class="tiny_wappen"></a>       </td>
        <td class="hauptlink">
            <a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037">U. de Chile</a>       </td>
    </tr>
    <tr>
        <td>
            <img src="https://tmssl.akamaized.net/images/flagge/tiny/33.png?lm=1520611569" title="Chile" alt="Chile" class="flaggenrahmen"> <a title="Primera División" href="/primera-division-de-chile/transfers/wettbewerb/CLPD">Primera División</a>        </td>
    </tr>
</tbody></table>
</td><td class=""><table class="inline-table">
    <tbody><tr>
        <td rowspan="2">
            <a title="Newcastle United" href="/newcastle-united/startseite/verein/762"><img src="https://tmssl.akamaized.net/images/wappen/tiny/762.png?lm=1472921161" title="Newcastle United" alt="Newcastle United" class="tiny_wappen"></a>     </td>
        <td class="hauptlink">
            <a title="Newcastle United" href="/newcastle-united/startseite/verein/762">Newcastle</a>        </td>
    </tr>
    <tr>
        <td>
            <img src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England" alt="England" class="flaggenrahmen"> <a title="Premier League" href="/premier-league/transfers/wettbewerb/GB1">Premier League</a>      </td>
    </tr>
</tbody></table>
</td><td class="rechts">-</td><td class="rechts">€3.00m</td><td class="rechts hauptlink">? </td><td class="zentriert hauptlink"><a title="Darío Osorio to Newcastle United?" id="27730/Newcastle United sent scouts to Chile to follow Dario Osorio. the 18-year-old is being monitored by Barcelona, ​​Wolverhampton and Newcastle United./http://www.90min.com//16127/180/Darío Osorio to Newcastle United?" class="icons_sprite icon-pinnwand-sprechblase sprechblase-wechselwahrscheinlichkeit" href="https://www.transfermarkt.co.uk/dario-osorio-to-newcastle-united-/thread/forum/180/thread_id/16127/post_id/27730#27730">&nbsp;&nbsp;&nbsp;</a></td></tr>

I want to scrape "Darió Osorio", "U. de Chile" and "Newcastle" all from the text of different elements with [class="hauptlink"] from the HTML.

I have tried a couple of different things, my most recent attempt looks like this:

$('.odd', html).each((index, el) => {
                const source = $(el)
                const information= source.find('td.main-link').first().text().trim()
                const differentInformation= source.find('a:nth-child(1)').text()
            })

But I am only successful in scraping "Darió Osorio" with the first()-method. The variable for "differentInformation" currently looks like this with my code: "Darió OsorioU. de ChileNewcastle". The result I want to get in the end is a JSON-Object like this:

[ 
{ "firstInfo" : "Darió Osorio",
 "secondInfo": "U. de Chile",
 "thirdInfo": "Newcastle"
 },
 { "firstInfo" : "Information",
 "secondInfo": "Different Information",
 "thirdInfo": "More Different Information" 
} 
] 

Solution

  • After clarification in the comments, it sounds like you're looking for something like this:

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    
    const url = "YOUR URL";
    
    (async () => {
      const response = await fetch(url);
    
      if (!response.ok) {
        throw Error(response.statusText);
      }
    
      const html = await response.text();
      const $ = cheerio.load(html);
    
      const data = [...$(".items .odd, .items .even")].map(e => {
        const [player, currentClub, interestedClub] =
          [...$(e).find(".hauptlink")].map(e => $(e).text().trim());
        return {player, currentClub, interestedClub};
      });
      console.log(data);
    })()
      .catch(error => console.log(error));
    

    This relies on .hauptlink which exists in the first 3 row cells that you're interested in retrieving, so that seems like the most straightforward solution. Perhaps a more robust solution would be to pick specific the <td> cells out you want.