Search code examples
ruby-on-railsamazon-s3robots.txtrails-activestorage

Prevent bots from accessing rails active_storage images


My site has a large number of graphs which are recalculated each day as new data is available. The graphs are stored on Amazon S3 using active_storage. A typical example would be

# app/models/graph.rb
class Graph < ApplicationRecord
  has_one_attached :plot
end

and in the view

<%= image_tag graphs.latest.plot %>

where graphs.latest retrieves the latest graph. Each day, a new graph and attached plot is created and the old graph/plot is deleted.

A number of bots, including from Google and Yandex are indexing the graphs, but then are generating exceptions when the bot returns and accesses the image again at urls like

www.myapp.com/rails/active_storage/representations/somelonghash

Is there a way to produce a durable link for the plot that does not expire when the graph/plot is deleted and then recalculated. Failing this, is there a way to block bots from accessing these plots.

Note that I currently have a catchall at the end of the routes.rb file:

get '*all', to: 'application#route_not_found', constraints: lambda { |req|
      req.path.exclude? 'rails/active_storage'
    } if Rails.env.production?

The exclusion of active storage in the catchall is in response to this issue. It is tempting to remove the active_storage exemption, but this might then stop proper active_storage routes.

Maybe I can put something in rack_rewrite.rb to fix this?


Solution

  • Interesting question.

    A simple solution would be to use the send_data functionality to send the image directly. However, that can have it's own issues, mostly in terms of probably increasing server bandwidth usage (and reducing server performance). However, a solution like that is what you need if you don't want to go through the trouble of the below as far as creating a redirect model goes and the logic around that.


    Original Answer

    The redirect will require setting up some sort of Redirects::Graph model. That basically can verify that a graph was deleted and redirect to the new graph instead of the requested one. It would have two fields, a old_signed_id (biglonghash) and a new_signed_id.

    Every time you delete

    We'll need to populate the redirects model and also add a new entry every time a new graph is created (we should be able to generate the signed_id from the blob somehow).

    For performance, and to avoid tons of redirects in a row which may result in a different error/issue. You'll have to manage the request change. IE: Say you now have a redirect A => B, you delete B replacing it with C now you need a A => C and B => C (to avoid an A => B => C redirect), this chain could get rather long indeed. This can be handled efficiently by just adding the new signed_id => new_id index and doing a Redirects::Graph.where(new_signed_id: old_signed_id).update_all(new_signed_id: new_signed_id) to update all the relevant old redirects, whenever you re-generate the graph.

    The controller itself is trickier, cleanest method I can think of is to monkey patch the ActiveStorage::RepresentationsController to add a before_action that does something like this (may not work as-is, the params[:signed_id] and the representations path may not be right):

    before_action :redirect_if_needed
    
    def redirect_if_needed
      redirect_model = Redirects::Graph.find_by(old_signed_id: params[:signed_id])
      redirect_to rails_activestorage_representations_path(
        signed_id: redirect_model.new_signed_id
      ) if redirect_model.present?
    end
    

    If you have version control setup for your database (IE: Papertrail gem or something) you may be able to work out the old_signed_id and new_signed_id with a bit of work and build the redirects for the urls currently causing errors. Otherwise, sadly this approach will only prevent future errors, and it may be impossible to get the current broken urls working.

    Ideally, though you would update the blob itself to use the new graph instead of the old graph rather than deleting, but not sure that's possible/practical.