I am trying to scrape information from a website with the following HTML format:
<tr class="odd">
<td class=""> <table class="inline-table">
<tbody><tr>
<td rowspan="2">
<img src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" data-src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" title="Darío Osorio" alt="Darío Osorio" class="bilderrahmen-fixed lazy entered loaded" data-ll-status="loaded"> </td>
<td class="hauptlink">
<a title="Darío Osorio" href="/dario-osorio/profil/spieler/881116">Darío Osorio</a> </td>
</tr>
<tr>
<td>Right Winger</td>
</tr>
</tbody></table>
</td><td class="zentriert">18</td><td class="zentriert"><img src="https://tmssl.akamaized.net/images/flagge/verysmall/33.png?lm=1520611569" title="Chile" alt="Chile" class="flaggenrahmen"></td><td class=""><table class="inline-table">
<tbody><tr>
<td rowspan="2">
<a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037"><img src="https://tmssl.akamaized.net/images/wappen/tiny/1037.png?lm=1420190110" title="Club Universidad de Chile" alt="Club Universidad de Chile" class="tiny_wappen"></a> </td>
<td class="hauptlink">
<a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037">U. de Chile</a> </td>
</tr>
<tr>
<td>
<img src="https://tmssl.akamaized.net/images/flagge/tiny/33.png?lm=1520611569" title="Chile" alt="Chile" class="flaggenrahmen"> <a title="Primera División" href="/primera-division-de-chile/transfers/wettbewerb/CLPD">Primera División</a> </td>
</tr>
</tbody></table>
</td><td class=""><table class="inline-table">
<tbody><tr>
<td rowspan="2">
<a title="Newcastle United" href="/newcastle-united/startseite/verein/762"><img src="https://tmssl.akamaized.net/images/wappen/tiny/762.png?lm=1472921161" title="Newcastle United" alt="Newcastle United" class="tiny_wappen"></a> </td>
<td class="hauptlink">
<a title="Newcastle United" href="/newcastle-united/startseite/verein/762">Newcastle</a> </td>
</tr>
<tr>
<td>
<img src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England" alt="England" class="flaggenrahmen"> <a title="Premier League" href="/premier-league/transfers/wettbewerb/GB1">Premier League</a> </td>
</tr>
</tbody></table>
</td><td class="rechts">-</td><td class="rechts">€3.00m</td><td class="rechts hauptlink">? </td><td class="zentriert hauptlink"><a title="Darío Osorio to Newcastle United?" id="27730/Newcastle United sent scouts to Chile to follow Dario Osorio. the 18-year-old is being monitored by Barcelona, Wolverhampton and Newcastle United./http://www.90min.com//16127/180/Darío Osorio to Newcastle United?" class="icons_sprite icon-pinnwand-sprechblase sprechblase-wechselwahrscheinlichkeit" href="https://www.transfermarkt.co.uk/dario-osorio-to-newcastle-united-/thread/forum/180/thread_id/16127/post_id/27730#27730"> </a></td></tr>
I want to scrape "Darió Osorio", "U. de Chile" and "Newcastle" all from the text of different elements with [class="hauptlink"] from the HTML.
I have tried a couple of different things, my most recent attempt looks like this:
$('.odd', html).each((index, el) => {
const source = $(el)
const information= source.find('td.main-link').first().text().trim()
const differentInformation= source.find('a:nth-child(1)').text()
})
But I am only successful in scraping "Darió Osorio" with the first()-method. The variable for "differentInformation" currently looks like this with my code: "Darió OsorioU. de ChileNewcastle". The result I want to get in the end is a JSON-Object like this:
[
{ "firstInfo" : "Darió Osorio",
"secondInfo": "U. de Chile",
"thirdInfo": "Newcastle"
},
{ "firstInfo" : "Information",
"secondInfo": "Different Information",
"thirdInfo": "More Different Information"
}
]
After clarification in the comments, it sounds like you're looking for something like this:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "YOUR URL";
(async () => {
const response = await fetch(url);
if (!response.ok) {
throw Error(response.statusText);
}
const html = await response.text();
const $ = cheerio.load(html);
const data = [...$(".items .odd, .items .even")].map(e => {
const [player, currentClub, interestedClub] =
[...$(e).find(".hauptlink")].map(e => $(e).text().trim());
return {player, currentClub, interestedClub};
});
console.log(data);
})()
.catch(error => console.log(error));
This relies on .hauptlink
which exists in the first 3 row cells that you're interested in retrieving, so that seems like the most straightforward solution. Perhaps a more robust solution would be to pick specific the <td>
cells out you want.