Search code examples
htmlurlscreen-scrapinghtml-content-extractionlayout-extraction

Extracting html elements in a given region?


Given a region defined by a rectangle and a url, is there any way to determine what elements lie within the given rectangle on the page at the given url?

EDIT: Screen resolution, Font size, etc.. can all be set to reasonable defaults.


Solution

    • Get the document from the URL.
    • Render it (in a browser).
    • For each element in the browser's DOM:
      • Get the rectangle[s] occupied by the element.
      • Compare the element's rectangle with the rectangle you're interested in.