Search code examples
javascriptnode.jshtml-parsing

NodeJS - Parse HTML and find certain strings multiple times


I am using puppeteer to load a website and then store the HTML of that site using:

html = await page.evaluate('new XMLSerializer().serializeToString(document.doctype) + document.documentElement.outerHTML');

This works fine and returns the html as it is supposed to do (can't use requests on this site long story short).

What I now need to do is in the HTML there is a chunk that looks like so:

<ul class="styled-radio">
<li>
<input type="radio" name="variant_id" id="variant_id_118018" value="118018">
<label for="variant_id_118018">5</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_118019" value="118019">
<label for="variant_id_118019">6</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_118020" value="118020">
<label for="variant_id_118020">6,5</label>
</li>
... keeps going ...
</ul>

For each variant_id_xxxxxx I need to get the xxxxxx number value and also the label inner text and then store it as xxxxxx:innerTextHere

For example for the first one in that block of text above it would be 118018:5

If we could then store all the xxxxxx:innerTextHere values in the array sizes that would also be great so the final result for the html above would be [118018:5, 118019:6, 118020:6,5]

Thanks in advance :)


Solution

  • you can use node package Cherrio to achieve above result. Please refer the sample code.

    const cheerio = require('cheerio')
    
    const data = `
    <ul class="styled-radio">
    <li>
    <input type="radio" name="variant_id" id="variant_id_118018" value="118018">
    <label for="variant_id_118018">5</label>
    </li>
    <li>
    <input type="radio" name="variant_id" id="variant_id_118019" value="118019">
    <label for="variant_id_118019">6</label>
    </li>
    <li>
    <input type="radio" name="variant_id" id="variant_id_118020" value="118020">
    <label for="variant_id_118020">6,5</label>
    </li>
    ... keeps going ...
    </ul>`;
    
    const result = [];
    
    const $ = cheerio.load(data);
    
    const variants = $("input[name='variant_id']");
    
    variants.each((index, { attribs }) => {
        const { id, value } = attribs;
        const label = $("label[for='" + id + "']");
        result.push({
            id,
            value,
            label: label.text()
        })
    })
    
    
    console.log(result);