Does anyone know what is proper and official algorithm that GitHub.com uses to encode fragment_id links for inner headers?
(I hope that is not considered anymore as tooooo broad question).
I reverse engineered way how GitHub flavoured Markdown formats links to content headers. It seems to be quite odd way, so I guess I did something wrong. Maybe you have clue how to improve it (except chaining, which is skipped here for readability of the steps).
First of all I found that such a string
1.2.3-a Łukasz_testing? header `special characters`;.,links How+they%20 behave
will be encoded there as
123-a-%C5%81ukasz_testing-header-special-characterslinks-howthey20-behave
.
I recreated same result with:
function(string) {
string = string.replace(/[A-Z]+/g,function(v) { return v.toLowerCase(); });
string = string.replace(/[^a-z0-9-\s\u00BF-\u1FFF\u2C00-\uD7FF\w]+/g,'');
string = string.replace(/[\s\t ]+/g, '-');
string = encodeURIComponent(string);
return string;
}
But it looks quite clunky. Any ideas how close is it to the original?
I agree with @elclanrs, chaining looks more concise:
function(string) {
return encodeURIComponent(string.replace(/[A-Z]+/g,function(v) { return v.toLowerCase(); })
.replace(/[^a-z0-9-\s\u00BF-\u1FFF\u2C00-\uD7FF\w]+/g,'')
.replace(/[\s\t ]+/g, '-'));
}
I wouldn't look for a "wiseass" implementation (such "one regex to rule them all"), this implementation is simple and readable - which would make it easy to maintain.