Search code examples
pythonperformanceweb-scrapingbeautifulsoupkodi

Beautifulsoup4 performance raspberry pi3


I am making a Kodi addon that i will run on my raspberry pi3. In my addon i scrape information from a website so i can fill a list of items. Everything i have right now is working but when i deploy it on my raspberry pi 3 the performance becomes an issue. It takes 15 seconds before the webpage is parsed

soup = BeautifulSoup(response, "html.parser", parse_only=tiles) << this line

I already use soupstrainer to improve performance but this did not have the impact i was hoping for.

    _VRT_BASE = "https://www.vrt.be/"

    def __list_videos_az(self):
    joined_url = urljoin(self._VRTNU_BASE_URL, "./a-z/")
    response = urlopen(joined_url)
    tiles = SoupStrainer('a', {"class": "tile"})
    soup = BeautifulSoup(response, "html.parser", parse_only=tiles)
    listing = []
    for tile in soup.find_all(class_="tile"):
        link_to_video = tile["href"]
        li = self.__get_item(tile, "false")
        url = '{0}?action=getepisodes&video={1}'.format(_url, link_to_video)
        listing.append((url, li, True))

    xbmcplugin.addDirectoryItems(_handle, listing, len(listing))
    xbmcplugin.addSortMethod(_handle, xbmcplugin.SORT_METHOD_LABEL_IGNORE_THE)
    xbmcplugin.endOfDirectory(_handle)

def __get_item(self, element, is_playable):
    thumbnail = self.__format_image_url(element)
    found_element = element.find(class_="tile__title")
    li = None
    if found_element is not None:
        li = xbmcgui.ListItem(found_element.contents[0]
                              .replace("\n", "").strip())
        li.setProperty('IsPlayable', is_playable)
        li.setArt({'thumb': thumbnail})
    return li

Could someone tell me how to improve the performance of the program? I was thinking maybe a regex would be faster but alot off people say that you should not parse html this way and putting together the regex is also challenging.

So is there anything i can do to improve my performance?


Solution

  • I'd recommend to try lxml parser which is written in C (Cython actually) and generally faster. To obtain this package try to install it from Raspbian (apt-get install python-lxml or pip install lxml) and then move it to your addon. lxml package contains compiled binary modules so it's important to obtain a version for your specific platform.