Search code examples
c#web-scrapingfilterhtml-agility-packtweets

How to identify if tweet is original or retweet in scraping with HtmlAgilityPack?


I wanted Twitter tweets of user for data analysis. For that I have used HtmlAgilityPack package to scrape Twitter and it gives me 30 top tweets.

I recognized tweet-text element and fetched all tweets. But I want to identify if it is tweet or retweet. How can I do that?

I have analysed HTML. In retweet there will be an element having tweet-context with-icn class. But when I scrape tweet on that class it throws null exception, because not all tweets will have that class. Then based on what and how can I scrape to get to know if it is retweet or not?

Code:

HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("https://twitter.com/BarackObama");

var TweetsNode= doc.DocumentNode.SelectNodes("//tr[@class='tweet-container']").ToList();

foreach (var item in TweetsNode)
{
    var tweet = new Tweets
    {
        console.WriteLine(item.innertext);
    };
}

In the above code, I have tried to fetch tweets of Barack Obama profile. I'm getting top 30 tweets. How can I recognize which one is retweet?
Thank you.


Solution

  • Scraping Twitter 101

    1. Get all Tweets from a page (which comes in handy tables <table class='tweet '>)

      HtmlWeb p = new HtmlWeb();
      var doc = p.Load(@"https://twitter.com/dailygametips");
      var nodes = doc.DocumentNode.SelectNodes("//table[@class='tweet  ']");
      
    2. Look in nodes for the <span class='context'> to indicated that this tweet is a retweet.

      List<Tweet> tweets = new List<Tweet>();
      foreach (var node in nodes)
      {
          bool isRetweet = false;
          var spanNode = node.SelectSingleNode(".//span[@class='context']");
          if (spanNode != null && spanNode.InnerHtml.Contains("retweeted"))
          {
              isRetweet = true;
          }
      
    3. We also want the Message Text, so scrap this next <div class='tweet-text'>:

          string msg = string.Empty;
          var msgNode = node.SelectSingleNode(".//div[@class='tweet-text']");
          if (msgNode != null)
          {
              msg = msgNode.InnerText.Trim();
          }
          tweets.Add(new Tweet(msg, isRetweet));
      }
      

    Additional the Tweet Container Class:

    class Tweet
    {
        public Tweet(string message, bool isRetweet)
        {
            Message = message;
            IsRetweet = isRetweet;
        }
    
        string Message { get; private set; }
        bool IsRetweet { get; private set; }
    }
    

    As you tell, this is not really rocket science. But you need to understand the basic principals of XPath and Scrapping.