Search code examples
phphtml-content-extraction

Extract a specific domain links from HTML of a website


Below is my code to extract links from a given link and my issue is when we view the source of the given Url there is a link with domain https://fs1.pdisk.pro:183 , but when i extracted links its not coming.

<?php
function extractLinks($url) {

  // Get the HTML content of the page.
  $html = file_get_contents($url);

  // Create a DOMDocument object.
  $dom = new DOMDocument();
  @$dom->loadHTML($html);

  // Get all the anchor elements.
  $anchors = $dom->getElementsByTagName('source');

  // Create an array to store the links.
  $links = array();

  // Loop through the anchor elements.
  foreach ($anchors as $anchor) 
  {
    // Get the href attribute of the anchor element.
    $href = $anchor->getAttribute('src');

    // Add the link to the links array.
    $links[] = $href;
  }

  // Return the links array as JSON.
  return json_encode($links);
}

// Get the URL of the website to extract links from.
$url = 'http://pdisk.investro1.com/how-to-buy-life-insurance-online-qfevac8cq8x4.html';

// Extract the links from the website.
$links = extractLinks($url);

// Print the links in JSON format.
echo json_encode($links);

Can someone help me to extract the all the needed domain link from the given url and if possible redirect to the link of that domain link which is extracted from the given url and give response in json format url=link like this.


Solution

  • You are asking a code to scrape a website. This is illegal to get certain contents without the source owner's concern.

    By saying this, the links with :183 port, if not under <a> tag. Its under <video>--><source> tag.

    Please correct your line $anchors = $dom->getElementsByTagName('a'); accordingly to $anchors = $dom->getElementsByTagName('source');.

    Also change the line $href = $anchor->getAttribute('href'); to $href = $anchor->getAttribute('src');.

    Beware : Web Scrapping need owner's permission to extract data from source website.