Search code examples
python-2.7urlweb-scrapingtext-extractiontweets

Is it possible to read tweet-text of a tweet URL without twitter API?


I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?

One such example of a tweet (shortened link) is as follows:

https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1

NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.


Solution

  • Scrape yourself

    Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.

    Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/

    XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.

    In your case:

    //div[contains(@class, 'permalink-tweet-container')]//strong[contains(@class, 'fullname')]/text()
    

    will get you the name of the author and

    //div[contains(@class, 'permalink-tweet-container')]//p[contains(@class, 'tweet-text')]//text()
    

    will get you the content of the Tweet.

    The full working example:

    from lxml import html
    import requests
    page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
    tree = html.fromstring(page.content)
    tree.xpath('//div[contains(@class, "permalink-tweet-container")]//p[contains(@class, "tweet-text")]//text()')
    

    results in:

    ['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']