Search code examples
pythonweb-scrapingpluginsbeautifulsoupkodi

Scraping links from a website using python / beautiful soup for a Kodi addon


The website I'm trying to scrape media links from (for a Kodi addon) doesn't have much in the way of class etc. markers, but each link is in some sort of unique layout.

I have created the basic Kodi addon from another working one, but I'm having issues getting Python/BeautifulSoup scraping the links. Other addons use the class etc. headers, but the website I'm trying to scrape from doesn't use much in the way of this.

I've tried all sorts of forums with no luck, most Kodi addons forums are old and not very active. The guides I've looked at go from step 1 to step 1000 very quickly it seems and the examples it gives aren't relevant. I've looked at 30 or so different addons thinking that should help, but I can't work it out.

The media links, episode titles, descriptions and images I'm trying to scrape are listed on www.thisiscriminal.com/episodes

The full addon I've done so far is at Github-repository

I can see in the source they're clearly set out (see code)

I basically just need to be able to parse a website, find the below bits for each episode, populate them as links on the kodi addon page and then list the next one underneath. Any help would be greatly appreciated. I've spent about 3 straight days trying to do this and am very both very glad and annoyed that I dropped out of that IT degree I started in 2002.

WEBSITE CODE I NEED TO PULL

(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>    

(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>

(episode number)
<h4>Episode #115</h4>

(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"

(episode description)
</header>When Cecilia....</article>

CODE

import requests
import re
from bs4 import BeautifulSoup

def get_soup(url):
    """
    @param: url of site to be scraped
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    print "type: ", type(soup)
    return soup

get_soup("https://thisiscriminal.com/episodes")

def get_playable_podcast(soup):
    """
    @param: parsed html page
    """
    subjects = []

    for content in soup.find_all('a'):

        try:
            link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
            link = link.get('href')
            print "\n\nLink: ", link

            title = content.find('<h4>Episode ')
            title = title.get_text()

            desc = content.find('div', {'class': 'summary'})
            desc = desc.get_text()


            thumbnail = content.find('img')
            thumbnail = thumbnail.get('src')
        except AttributeError:
            continue


        item = {
                'url': link,
                'title': title,
                'desc': desc,
                'thumbnail': thumbnail
        }

        #needto check that item is not null here
        subjects.append(item)

    return subjects

2019-06-09 00:05:35.719 T:1916360240 ERROR: Control 55 in window 10502 has been asked to focus, but it can't 2019-06-09 00:05:41.312 T:1165988576 ERROR: EXCEPTION Thrown (PythonToCppException) : -->Python callback/script returned the following error<- - NOTE: IGNORING THIS CAN LEAD TO MEMORY LEAKS! Error Type: Error Contents: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) Traceback (most recent call last): File "/home/osmc/.kodi/addons/plugin.audio.abcradionational/addon.py", line 44, in desc = soup.get_text().replace('\xa0', ' ').replace('\n', ' ') UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) -->End of Python script error report<-- 2019-06-09 00:05:41.636 T:1130349280 ERROR: GetDirectory - Error getting plugin://plugin.audio.abcradionational/ 2019-06-09 00:05:41.636 T:1916360240 ERROR: CGUIMediaWindow::GetDirectory(plugin://plugin.audio.abcradionational/) failed


Solution

  • The good news is that page gets a wp json source load for content and you can issue simple xhr against this. Other answer seems to cover nicely how to find this.

    You can then parse info out as you require from that json. The text description is as html within json returned so you can pass that to bs4 and parse as required. Example below. You can explore the json object in relation to Cecilia here, or, paste the following into a json viewer:

    {'title': 'Cecilia', 'excerpt': {'short': 'When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another...', 'long': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your...", 'full': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your first purchase..."}, 'content': '<p data-pm-context="[]">When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don&#8217;t.”</p>\n<p data-pm-context="[]">Sponsors:</p>\n<p><strong>Article</strong> Visit <a href="http://article.com/criminal">article.com/criminal </a>to get $50 off your first purchase of $100 or more.</p>\n<p><a href="https://www.therealreal.com/"><strong>The Real Real</strong></a> Shop in-store, online, or download the app, and get 20% off select items with the promo code REAL.</p>\n<p><strong>Simplisafe</strong> Protect your home today and get free shipping at <a href="http://SimpliSafe.com/CRIMINAL">SimpliSafe.com/CRIMINAL</a></p>\n<p><strong>Squarespace</strong> Try <a href="http://Squarespace.com/criminal">Squarespace.com/criminal </a>for a free trial and when you’re ready to launch, use the offer code INVISIBLE to save 10% off your first purchase of a website or domain.</p>\n<p><strong>Sun Basket</strong> Go to <a href="http://sunbasket.com/criminal">sunbasket.com/criminal </a>to get up to $80 off today!</p>\n', 'image': {'thumb': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-150x150.png', 'medium': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-300x300.png', 'large': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-1024x1024.png', 'full': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png'}, 'episodeNumber': '115', 'audioSource': 'https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3', 'musicCredits':"FALSE", 'id': 3129, 'slug': 'episode-115-cecilia-5-24-19', 'date': '2019-05-24 19:43:44', 'permalink': 'https://thisiscriminal.com/episode-115-cecilia-5-24-19/', 'next':"None", 'prev': {'slug': 'episode-114-philip-and-becky', 'title': 'Episode 114: Philip and Becky (5.10.2019)'}}
    

    The request is a queryString url so you can alter the number of items to return and within the response you will see listed the total number of pages so you know how many requests are needed to return all content.

    If you look here

    posts=1000&page=1
    

    you can see two parameters you can alter accordingly.

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000&page=1').json()
    
    for post in r['posts']:
        title = post['title']
        soup = bs(post['content'])
        desc = soup.select_one('p').text  # soup.get_text().replace('\xa0', ' ').replace('\n', ' ')
        img = post['image']['full']
        episode_link = post['audioSource'] #sure this is what you wanted?
        episode_number = post['episodeNumber']