Search code examples
regexstringtext-extraction

Find all span tags in a string using Javascript


I have a piece of text similar to this and it is basically a string of HTML code.

hello
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>

What I would like is to capture all of the span tags innerText (so in the example below, it would be Professional Referee) and store the results in an array.

The Regex - I am thinking this would be the way to go - I have is like this:

^/(<span)([\a-zA-Z0-9\s]*)(<\/span>)/$

I am not flash on regex, and the additional issues is that each span tag may have some attributes that are not equal to the other tags.

I think if I can get the full span tags from here in an array then I can manage to remove the left over stuff.

I got a regex101 link here: https://regex101.com/r/9K90pa/1

Can someone help me select on the right way?


Solution

  • Regex is not the ideal tool for analysing HTML. The DOM API offers a DOM Parser:

    const html = `hello
    <span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
    <div>....</div>
    <span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
    <div>....</div>
    <div>
    <span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
    </div>
    <span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
    <span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>`;
    
    const doc = new DOMParser().parseFromString(html, "text/html");
    const spanTexts = Array.from(doc.querySelectorAll("span"), span => span.textContent);
    
    console.log(spanTexts);