Search code examples
herokuseorobots.txtnoindex

robots.txt and disalowing absolute path URL


I am using Heroku pipes. So when I push my application it is pushed to staging app

https://appname.herokuapp.com/

and if everything is correct I promote that app to prodcution. There is no new build process. It is the same app that was build the first time for staging.

https://appname.com/

The thing is that this causes a problem with duplicate content. Sites are clones of each other. Exactly the same. I would like to exclude the staging app from Google indexing and search engine.

One way that I thought off was with robots.txt file.

For this to work I should write it like this

User-agent: *
Disallow: https://appname.herokuapp.com/

using the absolute path because this file will be on the server in staging and production application and I only wanna remove staging app from Google indexing and not touch the production one.

Is this the right way to do it?


Solution

  • No, the Disallow field can’t take full URL references. Your robots.txt would block URLs like these:

    • https://example.com/https://appname.herokuapp.com/
    • https://example.com/https://appname.herokuapp.com/foo

    The Disallow value always represents the beginning of the URL’s path.

    To block all URLs under https://appname.herokuapp.com/, you would need:

    Disallow: /
    

    So you have to use different robots.txt files for https://appname.herokuapp.com/ and https://appname.com/.

    If you don’t mind bots crawling https://appname.herokuapp.com/, you could make use of noindex instead. But this would also require different behaviour for both sites. An alternative that doesn’t require different behaviour could be to make use of canonical. This conveys to crawlers which URL is preferred for indexing.

    <!-- on https://appname.herokuapp.com/foobar -->
    <link rel="canonical" href="https://appname.com/foobar" />
    
    <!-- on https://appname.com/foobar -->
    <link rel="canonical" href="https://appname.com/foobar" />