Search code examples
javascriptdomweb-scrapingpuppeteerinnerhtml

Traversing a complex DOM and scraping values


Consider the following structure in DOM.

 <div class="bodyCells">
       <div style="foo">
           <div style="foo">
                <div style="foo"> 
                  <div style="foo1"> '1-contains the list of text elements I want to scrape'</div>
                  <div style="foo2"> '2-contains the list of text elements I want to scrape'</div>
                </div>
                <div style="foo"> 
                  <div style="foo3"> '3-contains the list of text elements I want to scrape'</div>
                  <div style="foo4"> '4-contains the list of text elements I want to scrape'</div>
                </div>
           </div>
       </div>
</div>     

By using class name bodyCells, I need to scrape out the data from each of the divs one at a time (i.e) Initially from 1st div, then from the next div and so on and store it in separate arrays. How can I possibly achieve this? (using puppeteer)

NOTE: I have tried using class name directly to achieve this but, it gives all the texts in a single array. I need to get data from each tag separately and store it in different arrays.

Expected Output:

  array1=["text present within style="foo1" div tag"] 
  array2=["text present within style="foo2" div tag"] 
  array3=["text present within style="foo3" div tag"]
  array4=["text present within style="foo4" div tag"]

This is what I've done so far:

 var value=[];
value = await page1.evaluate(() =>{
if (!window.document){window.document = {};}
var textitems=[]
var extracted_items=[]
textitems = document.getElementsByClassName("bodyCells");
for (var i = 0; i < textitems.length; i++) {
  item=textitems[i].textContent
  extracted_items.push(item);
}
  return extracted_items;
});

Solution

  • Not sure if this is what you need...

    const html = `
      <!doctype html>
      <html>
        <head><meta charset="UTF-8"><title>Test</title></head>
        <body>
          <div class="bodyCells">
            <div style="foo">
              <div style="foo">
                <div style="foo">
                  <div style="foo1"> '1-contains the list of text elements I want to scrape'</div>
                  <div style="foo2"> '2-contains the list of text elements I want to scrape'</div>
                </div>
                <div style="foo">
                  <div style="foo3"> '3-contains the list of text elements I want to scrape'</div>
                  <div style="foo4"> '4-contains the list of text elements I want to scrape'</div>
                </div>
              </div>
            </div>
          </div>
        </body>
      </html>`;
    
    const puppeteer = require('puppeteer');
    
    (async function main() {
      try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();
    
        await page.goto(`data:text/html,${html}`);
    
        const data = await page.evaluate(() => Array.from(
          document.querySelectorAll('div.bodyCells > div > div > div > div'),
          div => [div.innerText],
        ));
    
        console.log(data);
    
        await browser.close();
      } catch (err) {
        console.error(err);
      }
    })();
    

    Output:

    [
      [ "'1-contains the list of text elements I want to scrape'" ],
      [ "'2-contains the list of text elements I want to scrape'" ],
      [ "'3-contains the list of text elements I want to scrape'" ],
      [ "'4-contains the list of text elements I want to scrape'" ]
    ]