In an attempt to teach myself how to program, I'm making a web app (Flask, SQLAlchemy, Jijna) to display all the books I've ever ordered from Amazon.
In the "barest bones" possible way, I'm trying to learn how to replicate http://pinboard.in—that's my paragon. I have no idea how his site runs so fast: I can load 160 bookmark entries—all with associated tags—in, I dunno, 500 ms? ... which is why I know I am doing something wrong, as discussed below.
In any case, I created a many-to-many relationship between my books
Class and my tag
Class such that a user can (1) click on a book
and see all its tags
, as well as (2) click on a tag
and see all associated books. Here is my table architecture:
Here is the code for the relationship between the two Classes:
assoc = db.Table('assoc',
db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
)
class Book(db.Model):
__tablename__ = 'books'
book_id = db.Column(db.Integer, primary_key=True)
title = db.Column(db.String(120), unique=True)
auth = db.Column(db.String(120), unique=True)
comment = db.Column(db.String(120), unique=True)
date_read = db.Column(db.DateTime)
era = db.Column(db.String(36))
url = db.Column(db.String(120))
notable = db.Column(db.String(1))
tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))
def __init__(self, title, auth, comment, date_read, url, notable):
self.title = title
self.auth = auth
self.comment = comment
self.date_read = date_read
self.era = era
self.url = url
self.notable = notable
class Tag(db.Model):
__tablename__ = 'tags'
tag_id = db.Column(db.Integer, primary_key=True)
tag_name = db.Column(db.String(120))
If I iterate through the books
table only (~400 rows), the query runs and renders to the browser in lightning speed. No problem there.
{% for i in book_query %}
<li>
{{i.notable}}{{i.notable}}
<a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
<a href="/era/{{i.era}}">{{i.era}}</a> {{i.date_read}}
{% if i.comment %}
<p>{{i.comment}}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
If, however, I want to show any and all tags associated with a book, I change the code by nesting a for loop
as follows:
{% for i in book_query %}
<li>
{{i.notable}}{{i.notable}}
<a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
<a href="/era/{{i.era}}">{{i.era}}</a>
{% for ii in i.tagged %}
<a href="/tag/{{ii.tag_name}}">{{ii.tag_name}}</a>
{% endfor %}
{{i.date_read}}
{% if i.comment %}
<p>{{i.comment}}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
The query slows down significantly (takes about 20 seconds). My understanding is that this is happening because for every row in the book
table, my code is iterating through the entire assoc
table (i.e., "full table scan").
Obviously, I am a complete noob—I've been programming for ~3 months. It's motivating just to get things working, but I realize I have large gaps in my knowledge base that I am trying to fill as I go along.
Right off that bat, I can appreciate that it's incredibly inefficient that, with each new book, the code is iterating through the entire association table (if that's indeed what's happening, which I believe it is). I think I need to cluster(?) or sort(?) the assoc
table in such a way that once I retrieve all tags for book with book_id == 1
, I never again "check" the rows with book_id == 1
in the assoc
table.
In other words, what I think is happening is this (in computerspeak):
book_id == 1
in books
table has been taggedassoc
tablebook_id
in assoc
table equal to 1
?tag_id
for Row #1? ... [then computer goes to tag
table to get tag_name
, and returns it to the browser]book_id
in assoc
table equal to 1
?assoc
table looking for book_id == 1
when perhaps there aren't any more ...Then, once we get to book_id == 2
in the books table
the computer gets really mad:
book_id == 2
assoc
tablebook_id == 2
? (I know it isn't! But I have to check anyway because my programmer is a dum-dum ...)So the question is, can I (1) sort(?) or cluster(?) the assoc
table in some way that ensures more "intelligent" traversal through the assoc
table, or, as a friend of mine suggested, do I (2) "learn to write good SQL queries"? (Note, I've never learned SQL since I've been handling everything with SQLAlchemy ... damn Alchemists ... enshrouding their magics in secret and whatnot.)
Thanks for any input. If you have any suggestions to help me improve how I ask questions on stackoverflow (this is my first post!) please let me know.
Most of the answer is in the question.
In the first example 1 SQL query is executed when you iterate through books
table. In the second example a separate assoc
query is executed for every Book
. So it is about 400 SQL queries which are quite time consuming. You can view them in your app debug log if you set SQLALCHEMY_ECHO config parameter:
app.config['SQLALCHEMY_ECHO'] = True
Or you can install Flask-DebugToolbar and look at these queries in web interface.
The best approach to handle this problem is to learn SQL basics, you will need them anyway when your applications grow larger. Try to write a more optimized query in pure SQL. For your case it may look like this:
SELECT books.*, tags.tag_name FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
Then try to rewrite it in SQLAlchemy code and then group by book before passing to HTML renderer:
# Single query to get all books and their tags
query = db.session.query(Book, Tag.tag_name).join('tagged')
# Dictionary of data to be passed to renderer
books = {}
for book, tag_name in query:
book_data = books.setdefault(book.book_id, {'book': book, 'tags': []})
book_data['tags'].append(tag_name)
# Rendering HTML
return render_template('yourtemplate.html', books=books)
Template code will look like this:
{% for book in books %}
<li>
{{ book.book.notable }}{{ book.book.notable }}
<a href="{{ book.book.url }}">{{ book.book.title }}</a>, {{ book.book.auth }}
<a href="/era/{{ book.book.era }}">{{ book.book.era }}</a>
{% for tag in book.tags %}
<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>
{% endfor %}
{{ book.book.date_read }}
{% if book.book.comment %}
<p>{{ book.book.comment }}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}
Another approach
If your database is PostgreSQL you can write such query:
SELECT books.title, books.auth (...), array_agg(tags.tag_name) as book_tags FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
GROUP BY books.title, books.auth (...)
In this case you will get books data with already aggregated tags as an array. SQLAlchemy allows you to make such query:
from sqlalchemy import func
books = db.session.query(Book, func.array_agg(Tag.tag_name)).\
join('tagged').group_by(Book).all()
return render_template('yourtemplate.html', books=books)
And template has the following structure:
{% for book, tags in books %}
<li>
{{ book.notable }}{{ book.notable }}
<a href="{{ book.url }}">{{ book.title }}</a>, {{ book.auth }}
<a href="/era/{{ book.era }}">{{ book.era }}</a>
{% for tag in tags %}
<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>
{% endfor %}
{{ book.date_read }}
{% if book.comment %}
<p>{{ book.comment }}</p>
{% else %}
<!-- print nothing -->
{% endif %}
</li>
{% endfor %}