Search code examples
phpshort-url

How to identify domain name with Short URL like goo.gl in PHP?


I have a forum and I have MySQL table which store spam domains. So anyone can't put a URL which include in my MySQL table.

Some users spam using https://goo.gl/ service. But I can't block the goo.gl domain because it is effect for other users also.

Is there way to find actual domain using PHP, when users use Short URL Services like https://goo.gl/?


Solution

  • I can think of two ways to do this:

    1) This first one is specific to goo.gl, but other services may have similar interfaces: Use the Google URL Shortnener API. You can make requests, passing any goo.gl, and receive JSON back including the original URL, which you can then parse and extract the domain name to check against your blacklist.

    See https://developers.google.com/url-shortener/ for an overview, and https://developers.google.com/url-shortener/v1/url/get for the specific method.

    2) This is cruder, but should work for pretty much any shortener service: Simply request the URL (e.g. using cURL), and since it's essentially a redirection service, you should get back a HTTP 302 response, and included in the response headers will be a Location header showing the real URL. Again you can extract this, parse out the domain name and check it against your blacklist. This method certainly would work for goo.gl URLs, I've checked and they definitely return a 302 and the header. I would be surprised if other services did it any differently, as this is the standard www convention for notifying a HTTP client that the URL is a permanent redirect.

    Of course either method will add some overhead to your processing, so you should keep an eye on performance. You'll probably want to maintain a list of well-known URL shortening services, so you can first check whether you actually need to go and resolve it to the original URL or not. Otherwise you'll end up making a HTTP request to every single URL submitted by users, which won't be necessary and will slow things down - especially if those legitimate URLs are content-heavy and/or take a long time to respond (whereas an API call or call to a URL that just returns a simple 302 with no content should be fairly quick to reply).