I'm using Rails 3 to build a forum for myself and my friends (I didn't use one out of the box for a variety of reasons) and I'm at the point where I'm trying to implement full text search for the forum. Nothing fancy - just if someone searches the string "morning", I want to be able to show a list of all forum threads with posts containing the word "morning". I've been using pg_search for searching, but it's slow (5+ seconds) since we're already at 300 forum threads with 200k+ posts, some of which are 4k+ characters just in a single post. So I have this migration for multisearch:
class CreatePgSearchDocuments < ActiveRecord::Migration
def self.up
say_with_time("Creating table for pg_search multisearch") do
create_table :pg_search_documents do |t|
t.text :content
t.belongs_to :searchable, :polymorphic => true, :index => true
t.timestamps null: false
end
add_index :pg_search_documents, :content, using: "gin"
PgSearch::Multisearch.rebuild(Post)
PgSearch::Multisearch.rebuild(Reply)
end
end
end
But it's failing when I run the migration with this error:
PG::ProgramLimitExceeded: ERROR: index row size 3080 exceeds maximum 2712 for index "index_pg_search_documents_on_content"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
So far Googling has gotten me the following:
GIN indexes are better than GIST indexes for handling 100,000+ lexemes. This implies to me that GIN indexes should be capable of handling posts that are only 700 words
I had a guess that this error was about a single value rather than the length of the document, and was worried this was caused by the fact that I allow a subset of HTML tags in forum posts, so instead of storing post.content I now store post.sanitized_content. This strips all HTML, then replaces punctuation with spaces, then strips duplicates, like so: ActionView::Base.full_sanitizer.sanitize(content).gsub(/[^\w ]/, ' ').squeeze(" ")
. This got the error message down to index row size 2848 exceeds maximum 2712
, so it clearly did something, but not enough.
I then sanity checked that pg_search actually allows me to use dynamic methods like that, and it's not just secretly failing silently. According to the docs, "However, if you call any dynamic methods in :against, the following strategy will be used", so they seem to be handled just fine.
Relevant pieces of my implementation of Post:
class Post < ActiveRecord::Base
include PgSearch
multisearchable against: [:subject, :sanitized_content]
def sanitized_content
ActionView::Base.full_sanitizer.sanitize(content).gsub(/[^\w ]/, ' ').squeeze(" ")
end
end
(I have also tried remove :subject from the multisearchable-against array, in case it was an unsanitized subject causing the issues; this got me down to row size 2800
in the error, but didn't fix it.)
So... what am I missing? Shouldn't GIN indexes be able to handle large text documents? Do I need to turn my documents into tsvectors first like in this answer? It keeps suggesting "full text indexing" but I thought that's what this was.
For the sake of future people Googling: tentatively, using
execute "CREATE INDEX idx_fts_search_content ON pg_search_documents USING gin(to_tsvector('english', content))
instead of
add_index :pg_search_documents, :content, using: "gin"
has fixed it. Index so far isn't doing much, it takes a good 8.1s to search everything, but at least the migration runs now!
EDIT: Missed an important thing. The actual command should be:
execute "CREATE INDEX idx_fts_post_content ON posts USING gin(to_tsvector('english', coalesce(\"posts\".\"content\"::text, '')))"
If you don't have the coalesce(), it won't use the index.