Currently I am scraping article news sites, in the process of getting its main content, I ran into the issue that a lot of them have embedded tweets in them like these:
I use XPath expressions with XPath helper(chrome addon) in order to test if I can get content, then add this expression to scrapy python, but with elements that are inside a #shadow-root
elements seem to be outside the scope of the DOM, I am looking for a way to get content inside these types of elements, preferably with XPath.
One way to scrape pages containing shadow DOMs with tools that don't work with shadow DOM API is to recursively iterate over shadow DOM elements and replace them with their HTML code:
// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
let shadowHTML = '';
for (let el of shadowRoot.childNodes) {
shadowHTML += el.nodeValue || el.outerHTML;
}
return shadowHTML;
};
// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
for (let el of rootElement.querySelectorAll('*')) {
if (el.shadowRoot) {
replaceShadowDomsWithHtml(el.shadowRoot)
el.innerHTML += getShadowDomHtml(el.shadowRoot);
}
}
};
replaceShadowDomsWithHtml(document.body);
If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.
Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom