Search code examples
xpathiframe

Octoparse and relative Xpath iframe extraction issues


I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/

I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[@class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[@class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[@class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.

This portion //span[@class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.

Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?

When Absolute Path is used - //div[@class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1] enter image description here

enter image description here

vs.

When Relative Path is used - //span[@class="cp-episode-date"]

enter image description here

enter image description here

enter image description here


Solution

  • There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.

    For example, use Apple Podcast :

    https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231

    Dates could be recovered with the following XPath :

    //div[@class="l-row"]//time[@class]/@aria-label
    

    Other possibility, scrape the following page :

    https://feeds.captivate.fm/the-website-coach/

    Dates could be recovered with the following XPath :

    //h4/text()
    

    Even easier, get directly the data from this URL (.json file) :

    https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100