Search code examples
jekyllgithub-pagescode-generationstatic-site

Changes-based regeneration by static site generators


It seems like all of the static site generators I have found completely regenerate the entire site every time a change is made to some file in the site.

For example, one of the more popular site generators in use is Jekyll, which powers Github Pages. Every time an author makes a change (say a grammar correction in a post file, or a change to the about.html layout) and needs to regenerate that content, Jekyll gives no choice other that to regenerate the entire site, even if there are hundreds of files whose output is unchanged by the recent edits.

The time it takes to regenerate large sites seems to be a common complaint against most static site generators.

Is there any technical reason (from the POV of development or engineering of static site generators) that prevents someone from writing a static site generator that is "intelligent" about its contents and could be self-aware to the point that it could understand which files were changed and which files depend on it (or vice-versa) and would only regenerate the necessary files?

Since most people (especially Jekyll/GH Pages) users are storing their sites in git repository, it even seems like a site generator could make use of the commit information and track changes and rely on that information to know which files need to be regenerated and which can be left alone. Thoughts?


Solution

  • Short answer: it's hard.

    The hard part isn't knowing which files changed. The hard part is knowing what output files are affected by the files that changed. For example, if you change the title of a blog post, the main blog index will need to be updated. So will any tag pages. So will any page which list the other post as a "related post". If you have excerpts on your homepage, same deal.

    But that's not impossible to deal with. You can keep a directed acyclic graph which tracks the dependencies for any given page, and regenerate the pages which include bits of other pages that change. It adds overhead and code complexity, as well as computation time, but doing this would probably be worth the effort.

    Harder than that, though, is knowing which pages need to be regenerated as a result of changes to items they're not already associated with. What happens if you add a new tag to a blog post? Now the tag page for that new tag needs to be regenerated as well. If you're using tags to generate "related posts", all posts on your site should be regenerated, since the "best" relations for any given post could be different now. What happens when you add a new post? To avoid unnecessary compilation, the static site generator must know which pages would have included that post if it were around, and regenerate them as well.

    Note that, in all these cases, false positives (pages which haven't changed, but are recompiled anyway) are acceptable, but false negatives (pages which should be recompiled, but are not) are absolutely unacceptable. So in every case, the site generator must err on the side of caution: if there's any possibility that a page would change were it compiled again, it must be recompiled.

    Nanoc, for example, does track changes like you mention. It keeps a directed acyclic graph of pages that depend on other pages, and it caches it between compiles to limit the number of recompiles. It doesn't regenerate every page every time, but it does often recompile some pages which don't need to be compiled. There's still a lot of room to improve.