I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.
Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(@class, 'permalink-tweet-container')]//strong[contains(@class, 'fullname')]/text()
will get you the name of the author and
//div[contains(@class, 'permalink-tweet-container')]//p[contains(@class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(@class, "permalink-tweet-container")]//p[contains(@class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']