Search code examples
javascriptscreen-scrapingcheerio

Extract text with cheerio


I'm trying to write a script to extract email id and name from this website. I tried the following snippet but it doesn't work.

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <title>foo</title>
    <meta name="description" content="">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="">
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
</head>

<body>
    <div>
        <strong style="color: darkgreen;">Can read this tag</strong>

        <object id="external_page" type="text/html" data="https://aleenarais.com/buddy/" width="800px" height="600px"
            style="overflow:auto;border:5px ridge blue">
            <!-- I want to read tag values from this object -->
        </object>
    </div>

    <script>
        window.addEventListener('load', function () {
            const item = [];
            $('strong[style="color: darkgreen;"]').each(function () {
                item.push($(this).text())
            })
            console.log(item)

        })
       
    </script>
</body>

</html>

Is there any better way to do this? Or is it possible to convert the whole page into a string and extract the email using RegEx?


Solution

  • The email and name of in the webpage are being rendered in an iframe. The source of iframe is an external source. In order for you to extract the information, you need to use a headless browser to do that.

    I would suggest using Node.JS & Puppeteer (https://www.npmjs.com/package/puppeteer)

    const puppeteer = require("puppeteer");
    (async() => {
      const url = "https://aleenarais.com/buddy/";
      const browser = await puppeteer.launch();
      const page = await browser.newPage();
      await page.goto(url, {
        waitUntil: "networkidle0"
      });
      var frames = await page.frames();
      var myframe = frames.find(
        (f) => f.url().indexOf("https://feedium.app/fetchh.php") > -1
      );
      const textFeed = await myframe.$$eval("strong", (sElements) =>
        sElements.map((el) => el.textContent)
      );
      console.log(textFeed.splice(1)); //Array contains both name and email
      await browser.close();
    })();

    Puppeteer loads the page similar to how a user loads the page. It waits until all the network calls are done (see network idle0) and then it tries finding the iframe which has the url (fetchh.php). If you observe, name and email are present in strong tags and they are the only strong tags available. Hence, we are extracting the strong tags, removing the count and we are left with just the name and email.

    Output: [ 'JJ', 'j*[email protected]' ] //I have just masked the values but the program gives the actual ones

    Steps to run the script:

    1. Install Node.Js (https://nodejs.org/en/download/)
    2. Install puppeteer using (npm i puppeteer)
    3. copy the script and place it in file (demo.js)
    4. In the terminal, navigate to the directory in which the demo.js is present and then run node demo.js

    You should see the output.