Search code examples
javascriptphphtmlstrip-tags

I want to scrape data within a local web directory


All the pages are connected via some href elements. The very first page is named mainpage.html. Now I want to remove the <image> tags from all the webpages and show elements within <div id = "pB">.

Instead of removing image tags manually from one page to another, I'd like a generic method for this purpose. Any suggestions or queries from me you can ask me, thanks in advance.

the structure of tree is

<html> -> <body> -> <div id= pB>

Solution

  • As the structure and aim of your project are not totally clear to me, i will try to give you some hints for the various aspects i can identify. I am assuming a solution in PHP.

    find all pages from within your mainpage.html: Regexp for extracting all links and anchor texts from HTML

    or even more elegant

    Regexp for extracting all links and anchor texts from HTML

    alternatively, you mentioned a "local web directory" so you could also get all files via

    https://www.php.net/manual/en/function.glob.php

    Assuming you have all the filenames of files you want to parse in an $array, you could iterate over that array, open each file and use the mentioned modification from here

    http://www.php.net/manual/en/function.strip-tags.php#86964

    Either you then save your modified pages or you display them in your div.

    Hope this helps a bit.