I need a little advice, because I am stuck with scraping one web page with Apify. I am using apify/web-scraper and basic scraping is already working (name, description, price etc.), but there are product variants on the page and I have no idea what would be the best method to scrape this data.
The product variant form looks like this:
<form class="qty add-to-basket add-to-basket-multi" method="post" action="">
<fieldset>
<table>
<tbody>
<tr>
<td>white</td>
<td>
<span>
<a href="" class="watch fancybox-modal">x</a>
</span>
</td>
</tr>
<tr class="alt-row">
<td>black</td>
<td>
<span class="tooltip" data-tip="6 pcs stock" data-tippos="top">
<input type="text" name="form_product_add_to_basket[count_2945]" value="0" class="count" data-id="2945" placeholder="ks" autocomplete="off" />
</span>
</td>
</tr>
<tr>
<td>green</td>
<td>
<span>
<a href="" class="watch fancybox-modal">x</a>
</span>
</td>
</tr>
</tbody>
</table>
</fieldset>
</form>
As you can see if product is not available there is no , otherwise there is.
In output I would like to get something like this (I describe with XML as later I need to convert Apify's output to XML):
<variants>
<variant>
<name>white</name>
<stock>0</stock>
</variant>
<variant>
<name>black</name>
<stock>6</stock>
</variant>
<variant>
<name>green</name>
<stock>0</stock>
</variant>
</variants>
Stock "6" for Black variant is coming from data-tip. I think this is somehow possible to get with regex.
My current code without variants:
async function pageFunction(context) {
const {
request,
log
} = context;
var result = [];
if (!$(".product-desc").length) {
return null;
} else {
const {
url
} = request;
const category = url.split("?category=")[1];
const title = $('.price-desc h1').text();
var description = '';
if ($('#product-desc li').text().length > 0) {
description = $('#product-desc li').text()
} else {
description = $('.desc p:last').text()
}
const price = $('.wvat span:eq(0)').text();
return {
category,
title,
description,
price
}
}
}
you can use the tr
as your item delimiter, so you can then extract from each td
const variants = [];
$('tbody tr').each((_, el) => {
const $el = $(el);
variants.push({
name: $el.find('td:eq(0)').text(),
stock: parseInt($el.find('[data-tip]').attr('data-tip')) || $el.find('td:eq(1)').text().replace('x', '0'),
});
})