Search code examples
pythonpython-3.xlxmlelementtreepython-requests-html

'/xad' appearing in list of strings in python code


Firstly, I am a beginner, just bordering on intermediate with python, so please be patient with my approach to this problem. I was working on a web scraping mini project using lxml etree and requests (code is beneath this paragraph). I wanted to scrape website about a current media spectacle and decided to go about this an OOP way for practice (although I doubt this way is befitting or implemented at all well, feedback on this would be greatly appreciated) and so I could re-use the class to scrape other pages in the same script. This is when i noticed that when I tried to retrieve and print text from <p> and <span> elements in the get_stories() method, '/xad' would should up often and in strange places. I could not find anything specific to my situation from the internet, but I did find some stuff to do with encoding/decoding and unicode which I am not too familiar with. Maybe there is an issue to do with encoding/decoding when the raw html is converted into an elements text attribute? But as I say this is beyond me, constructive feedback on my code and my problem would be very much appreciated. Thanks!

from lxml import etree
import requests

class Page:
    
    headers = {"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"}

    def __init__(self, url):
        try:
            self.html = requests.get(url, headers=Page.headers).text 
            self.tree = etree.HTML(self.html)
            self.articles = dict()
            self.var = None
        except:
            raise SystemExit("Invalid url")

    def get_stories(self):
        headers = [span.text for span in self.tree.xpath('//a[@class="u-clickable-card__link"]//span')]
        snippets = [span.text for span in self.tree.xpath('//div[@class="gc__excerpt"]//p')]
        print(headers)
       

url = "https://www.aljazeera.com/tag/julian-assange/"

page1 = Page(url)
page1.get_stories()

Here is the output:

['The Take: What will hap\xadpen to Ju\xadlian As\xadsange if he is ex\xadtra\xaddit\xaded?', 'The tri\xadals of Ju\xadlian As\xadsange: A death sen\xadtence for democ\xadra\xadcy', 'US lawyers urge UK court to block Ju\xadlian As\xadsange ex\xadtra\xaddi\xadtion ap\xadpeal bid', 'His\xadto\xadry Il\xadlus\xadtrat\xaded: Ju\xadlian As\xadsange’s last stand?', 'Wik\xadiLeaks founder Ju\xadlian As\xadsange makes fi\xadnal bid to avoid ex\xadtra\xaddi\xadtion to US', 'Why does the US want Ju\xadlian As\xadsange ex\xadtra\xaddit\xaded?', 'Who is Ju\xadlian As\xadsange? Will he be ex\xadtra\xaddit\xaded to the US?', '‘Enough is enough’: Aus\xadtralian PM de\xadnounces US, UK le\xadgal pur\xadsuit of As\xadsange', 'Aus\xadtralian law\xadmak\xaders press US en\xadvoy for Ju\xadlian As\xadsange re\xadlease', 'What does the fu\xadture hold for Ju\xadlian As\xadsange?', 'The Im\xadpris\xadon\xadment of Ju\xadlian As\xadsange', 'Protests in Chi\xadna: The blank sheets tell a tale', 'Top me\xaddia out\xadlets de\xadmand US end pros\xade\xadcu\xadtion of Ju\xadlian As\xadsange', 'In new book, a jour\xadnal\xadist makes the case for Ju\xadlian As\xadsange']


Solution

  • Hex code 0xad, rendered as \xad in your text, is the Unicode code point for a soft hyphen.

    This is to allow formatters to easily figure out where they can put hyphens if the text needs to be split. For example, rendering the first string in your list on a particularly thin display device:

    The Take: What will happen to Ju-
    lian Assange if he is extradited?

    If you want to get rid of them you can use something like:

    new_list = [item.replace("\xad", "") for item in old_list]