Search code examples
c#urlstripgoogle-alerts

I need to strip a Google Alerts URL


To preface, I know there are similar threads about this, but I am using C#, not java, or python, or Php. Some threads provide a solution for a single URL, which is not universal. Thanks for not flagging me.

So I am using Google Alerts to get links to articles via email. I have already written a program that can strip the URLs out of the email as well as another program to scrape the websites. My issue is that the links in the google alerts email look like this:

https://www.google.com/url?rct=j&sa=t&url=http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html&ct=ga&cd=CAEYACoTOTc2NjE4NjYyNzMzNzc3NDcyODIaODk2NWUwYzRjMzdmOGI4Nzpjb206ZW46VVM&usg=AFQjCNGyK2EyVBLoKnNkdxIBDf8a_B3Ung. Yeah, ugly.

Because this redirects to the actual article through google, my scraping program does not work on these links. I have tried a million different RegExs from questions here and other sources. I managed to strip off everything up until the http:// of the actual article but it still has the tail end that screws it up. Here is what I have so far. They now look like:

http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html&ct=ga&cd=CAEYACoTOTc2NjE4NjYyNzMzNzc3NDcyODIaODk2NWUwYzRjMzdmOGI4Nzpjb206ZW46VVM&usg=AFQjCNGyK2EyVBLoKnNkdxIBDf8a_B3Ung

    private List<string> GetLinks(string message)
    {
        List<string> list = new List<string>();
        Regex urlRx = new Regex(@"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)", RegexOptions.IgnoreCase);

        MatchCollection matches = urlRx.Matches(message);
        foreach (Match match in matches)
        {
            if(!match.ToString().Contains("news.google.com/news") && !match.ToString().Contains("google.com/alerts"))
            {
                string find = "=http";
                int ind = match.ToString().IndexOf(find);                    
                list.Add(match.ToString().Substring(ind+1));
            }                
        }
        return list;
    }        

Some help getting rid of the endings would be awesome, be it a new RegEx or some extra code. Thanks in advance.


Solution

  • You can use HttpUtility.ParseQueryString to retrieve the url part of the query string. It is located in the System.Web namespace (reference required).

    var uri = new Uri("https://www.google.com/url?rct=j&sa=t&url=http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html&ct=ga&cd=CAEYACoTOTc2NjE4NjYyNzMzNzc3NDcyODIaODk2NWUwYzRjMzdmOGI4Nzpjb206ZW46VVM&usg=AFQjCNGyK2EyVBLoKnNkdxIBDf8a_B3Ung");
    var queries = HttpUtility.ParseQueryString(uri.Query);
    var foxNews = queries["url"]; //http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html