Robustly generate anchors in Markdown

I have some Ruby code to auto generate tables of contents in GitHub Flavoured Markdown. It would be good to understand other flavours of Markdown too if there are differences relevant to this problem.

At the moment, I have this code that works 99% of the time:

  def header_to_anchor
    @header
      .downcase
      .gsub(/[^a-z\d\- ]+/, "")
      .gsub(/ /, "-")
  end

This is based on a note I found in a GitHub comment here. It reads:

The code that creates the anchors is here: https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/toc_filter.rb

It downcases the string

remove anything that is not a letter, number, space or hyphen (see the source for how Unicode is handled)

changes any space to a hyphen.

If that is not unique, add "-1", "-2", "-3",... to make it unique

For my purposes, I don't need to solve the uniqueness problem.

This was great until I found another edge case that it failed on, namely, I have a heading in a markdown doc that is:

### shunit2/_shared.sh

And my code generates an anchor that is:

* [shunit2/_shared.sh](#shunit2sharedsh)

And creates another broken link, at least as far as GitHub Flavoured Markdown is concerned.

I've also seen this answer here, but those rules specified there appear to be also not quite robust.

Does anyone know of authoritative documentation that explains the rules for generating these anchors?

Solution

Well the confusion here appears¹ to be that the Ruby regex in the code mentioned in the GitHub comment does something a bit different to what the comment says. The code uses this regex:

PUNCTUATION_REGEXP = RUBY_VERSION > '1.9' ? /[^\p{Word}\- ]/u : /[^\w\- ]/

To delete "punctuation". Ruby regexes are documented here.

Meanwhile, \p{Word} actually means alphanumeric plus underscore.

So, the comment in the GitHub issue, "remove anything that is not a letter, number, space or hyphen (see the source for how Unicode is handled)" is a misreading of the code.

The correct rules should be:

It downcases the string

Remove anything that is not a letter, number, space, underscore or hyphen (see the source for how Unicode is handled)

Change any space to a hyphen.

If that is not unique, add "-1", "-2", "-3",... to make it unique

¹ Assuming, of course, that the toc_filter.rb file mentioned in the GitHub issue really is the "source of truth" rather than an implementation of rules defined elsewhere.