Search code examples
javascripthtmlcheerio

Parsing a mix of plaintext and <b>, <br> tags into an array of strings


The wiki pages that I am trying to parse include the following html:

<div
  class="pi-smart-data-value pi-data-value pi-font pi-item-spacing pi-border-color"
  style="width: calc(1 / 1 * 100%)"
  data-source="unique_ability"
>
  <b>Captain</b><br /><b>"Back off!"</b><br />Push target on hit.
</div>

What I would like to parse the content of the div into is an array like this: ["Captain", "Back off!", "Push target on hit."]

If I use the text() method from cheerio (const uniqueAbilities = $('[data-source="unique_ability"]').text()) I get a long string like this: Captain"Back off!"Push target on hit. If I use the html() method (const uniqueAbilities = $('[data-source="unique_ability"]').html();) from cheerio I get the HTML content of the node, but I am then unable to parse it as a string.

How would you parse this html into the desired output?

Thanks for the help.


Solution

  • Here is a way to obtain your desired output but it might not cover all the other cases :

    //get HTML content
    let data = $('[data-source="unique_ability"]').html();
    console.log(data);
    //remove all carriage return
    data = data.replaceAll(/[\n\r]+/g, '');
    console.log(data);
    //split the string  around HTML tags
    data = data.split(/<[^>]+>/g);
    console.log(data);
    //remove empty strings from the array
    data = data.filter(el => el.trim() != "");
    console.log(data);
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
    <div
      class="pi-smart-data-value pi-data-value pi-font pi-item-spacing pi-border-color"
      style="width: calc(1 / 1 * 100%)"
      data-source="unique_ability"
    >
      <b>Captain</b><br /><b>"Back off!"</b><br />Push target on hit.
    </div>