Search code examples
sqldatabaseindexingsqlalchemyclustered-index

Improve speed of query in a many-to-many relationship


In an attempt to teach myself how to program, I'm making a web app (Flask, SQLAlchemy, Jijna) to display all the books I've ever ordered from Amazon.

In the "barest bones" possible way, I'm trying to learn how to replicate http://pinboard.in—that's my paragon. I have no idea how his site runs so fast: I can load 160 bookmark entries—all with associated tags—in, I dunno, 500 ms? ... which is why I know I am doing something wrong, as discussed below.

In any case, I created a many-to-many relationship between my books Class and my tag Class such that a user can (1) click on a book and see all its tags, as well as (2) click on a tag and see all associated books. Here is my table architecture:

Entity relationship diagram

Here is the code for the relationship between the two Classes:

assoc = db.Table('assoc',
    db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
    db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
)

class Book(db.Model):
    __tablename__ = 'books'
    book_id = db.Column(db.Integer, primary_key=True)
    title = db.Column(db.String(120), unique=True)
    auth = db.Column(db.String(120), unique=True)
    comment = db.Column(db.String(120), unique=True)
    date_read = db.Column(db.DateTime)
    era = db.Column(db.String(36))
    url = db.Column(db.String(120))
    notable = db.Column(db.String(1))

    tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))

    def __init__(self, title, auth, comment, date_read, url, notable):
        self.title = title
        self.auth = auth
        self.comment = comment
        self.date_read = date_read
        self.era = era
        self.url = url
        self.notable = notable
    
class Tag(db.Model):
    __tablename__ = 'tags'
    tag_id = db.Column(db.Integer, primary_key=True)
    tag_name = db.Column(db.String(120))

the problem

If I iterate through the books table only (~400 rows), the query runs and renders to the browser in lightning speed. No problem there.

{% for i in book_query %}
    <li>
      {{i.notable}}{{i.notable}}
      <a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
      <a href="/era/{{i.era}}">{{i.era}}</a> {{i.date_read}}
        {% if i.comment %}
          <p>{{i.comment}}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
{% endfor %}

If, however, I want to show any and all tags associated with a book, I change the code by nesting a for loop as follows:

{% for i in book_query %}
    <li>
      {{i.notable}}{{i.notable}}
      <a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
      <a href="/era/{{i.era}}">{{i.era}}</a>
        {% for ii in i.tagged %}
            <a href="/tag/{{ii.tag_name}}">{{ii.tag_name}}</a>
        {% endfor %}
      {{i.date_read}}
        {% if i.comment %}
          <p>{{i.comment}}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
  {% endfor %}

The query slows down significantly (takes about 20 seconds). My understanding is that this is happening because for every row in the book table, my code is iterating through the entire assoc table (i.e., "full table scan").

discussion (or, "what i think is happening")

Obviously, I am a complete noob—I've been programming for ~3 months. It's motivating just to get things working, but I realize I have large gaps in my knowledge base that I am trying to fill as I go along.

Right off that bat, I can appreciate that it's incredibly inefficient that, with each new book, the code is iterating through the entire association table (if that's indeed what's happening, which I believe it is). I think I need to cluster(?) or sort(?) the assoc table in such a way that once I retrieve all tags for book with book_id == 1, I never again "check" the rows with book_id == 1 in the assoc table.

In other words, what I think is happening is this (in computerspeak):

  • Oh, he wants to know how the book with book_id == 1 in books table has been tagged
  • Okay, let me go to the assoc table
  • Row #1 ... Is book_id in assoc table equal to 1?
  • Okay, it is; then what is the tag_id for Row #1? ... [then computer goes to tag table to get tag_name, and returns it to the browser]
  • Row #2 ... is book_id in assoc table equal to 1?
  • Oh, no, it isn't ... okay, go to Row #3
  • Hmmmm, because my programmer is stupid and didn't make this table sorted or indexed in some way, I'm going to have to go through the entire assoc table looking for book_id == 1 when perhaps there aren't any more ...

Then, once we get to book_id == 2 in the books table the computer gets really mad:

  • Okay, he wants to know all the tags that go with book_id == 2
  • Okay, let me go to the assoc table
  • Row #1 ... wait a second ... didn't I check this one already?? Holy sh#t, I have to do this all over again??
  • Dammit ... okay ... Row #1 ... is book_id == 2? (I know it isn't! But I have to check anyway because my programmer is a dum-dum ...)

questions

So the question is, can I (1) sort(?) or cluster(?) the assoc table in some way that ensures more "intelligent" traversal through the assoc table, or, as a friend of mine suggested, do I (2) "learn to write good SQL queries"? (Note, I've never learned SQL since I've been handling everything with SQLAlchemy ... damn Alchemists ... enshrouding their magics in secret and whatnot.)

final words

Thanks for any input. If you have any suggestions to help me improve how I ask questions on stackoverflow (this is my first post!) please let me know.


Solution

  • Most of the answer is in the question.

    In the first example 1 SQL query is executed when you iterate through books table. In the second example a separate assoc query is executed for every Book. So it is about 400 SQL queries which are quite time consuming. You can view them in your app debug log if you set SQLALCHEMY_ECHO config parameter:

    app.config['SQLALCHEMY_ECHO'] = True
    

    Or you can install Flask-DebugToolbar and look at these queries in web interface.

    The best approach to handle this problem is to learn SQL basics, you will need them anyway when your applications grow larger. Try to write a more optimized query in pure SQL. For your case it may look like this:

    SELECT books.*, tags.tag_name FROM books
    JOIN assoc ON assoc.book_id = books.book_id
    JOIN tags ON assoc.tag_id = tags.tag_id
    

    Then try to rewrite it in SQLAlchemy code and then group by book before passing to HTML renderer:

    # Single query to get all books and their tags
    query = db.session.query(Book, Tag.tag_name).join('tagged')
    # Dictionary of data to be passed to renderer
    books = {}
    for book, tag_name in query:
        book_data = books.setdefault(book.book_id, {'book': book, 'tags': []})
        book_data['tags'].append(tag_name)
    # Rendering HTML
    return render_template('yourtemplate.html', books=books)
    

    Template code will look like this:

    {% for book in books %}
    <li>
      {{ book.book.notable }}{{ book.book.notable }}
      <a href="{{ book.book.url }}">{{ book.book.title }}</a>, {{ book.book.auth }}
      <a href="/era/{{ book.book.era }}">{{ book.book.era }}</a>
      {% for tag in book.tags %}
        &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
      {% endfor %}
      {{ book.book.date_read }}
        {% if book.book.comment %}
          <p>{{ book.book.comment }}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
    {% endfor %}
    

    Another approach

    If your database is PostgreSQL you can write such query:

    SELECT books.title, books.auth (...), array_agg(tags.tag_name) as book_tags FROM books
    JOIN assoc ON assoc.book_id = books.book_id
    JOIN tags ON assoc.tag_id = tags.tag_id
    GROUP BY books.title, books.auth (...)
    

    In this case you will get books data with already aggregated tags as an array. SQLAlchemy allows you to make such query:

    from sqlalchemy import func
    
    books = db.session.query(Book, func.array_agg(Tag.tag_name)).\
        join('tagged').group_by(Book).all()
    return render_template('yourtemplate.html', books=books)
    

    And template has the following structure:

    {% for book, tags in books %}
    <li>
      {{ book.notable }}{{ book.notable }}
      <a href="{{ book.url }}">{{ book.title }}</a>, {{ book.auth }}
      <a href="/era/{{ book.era }}">{{ book.era }}</a>
      {% for tag in tags %}
        &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
      {% endfor %}
      {{ book.date_read }}
        {% if book.comment %}
          <p>{{ book.comment }}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
    {% endfor %}