Search code examples
javascriptjqueryhtmlregexhtml-parsing

Better way of extracting text from HTML in Javascript


I'm trying to scrape text from an HTML string by using container.innerText || container.textContent where container is the element from which I want to extract text.

Usually, the text I want to extract is located in <p> tags. So for the HTML below as an example:

<div id="container">
    <p>This is the first sentence.</p>
    <p>This is the second sentence.</p>
</div>

Using

var container = document.getElementById("container");
var text = container.innerText || container.textContent; // the text I want

will return This is the first sentence.This is the second sentence. without a space between the first period and the start of the second sentence.

My overall goal is to parse text using the Stanford CoreNLP, but its parser cannot detect that these are 2 sentences because they are not separated by a space. Is there a better way of extracting text from HTML such that the sentences are separated by a space character?

The HTML I'm parsing will have the text I want mostly in <p> tags, but the HTML may also contain <img>, <a>, and other tags embeeded between <p> tags.


Solution

  • jQuery has the method text() that does what you want. Will this work for you?

    I'm not sure if it fits for everything that's in your container but it works in my example. It will also take the text of a <a>-tag and appends it to the text.

    Update 20.12.2020

    If you're not using jQuery. You could implement the text method with vanilla js like this:

    const nodes = Array.from(document.querySelectorAll("#container"));
    const text = nodes
      .filter((node) => !!node.textContent)
      .map((node) => node.textContent)
      .join(" ");
    

    Using querySelectorAll("#container") to get every node in the container. Using Array.from so we can work with Array methods like filter, map & join.

    Finally, generate the text by filtering out elements with-out textContent. Then use map to get each text and use join to add a space separator between the text.

    $(function() {
        var textToParse = $('#container').text();
        $('#output').html(textToParse);
    });
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
    <div id="container">
        <p>This is the first sentence.</p>
        <p>This is the second sentence.</p>
        <img src="http://placehold.it/200x200" alt="Nice picture"></img>
        <p>Third sentence.</p>
    </div>
    
    <h2>output:</h2>
    <div id="output"></div>