Search code examples
javascriptregexgithubutf-8encode

GitHub.com heading links - JavaScript


Does anyone know what is proper and official algorithm that GitHub.com uses to encode fragment_id links for inner headers?

(I hope that is not considered anymore as tooooo broad question).

I reverse engineered way how GitHub flavoured Markdown formats links to content headers. It seems to be quite odd way, so I guess I did something wrong. Maybe you have clue how to improve it (except chaining, which is skipped here for readability of the steps).

First of all I found that such a string 1.2.3-a Łukasz_testing? header `special characters`;.,links How+they%20 behave will be encoded there as 123-a-%C5%81ukasz_testing-header-special-characterslinks-howthey20-behave.

I recreated same result with:

function(string) {
    string = string.replace(/[A-Z]+/g,function(v) { return v.toLowerCase(); });
    string = string.replace(/[^a-z0-9-\s\u00BF-\u1FFF\u2C00-\uD7FF\w]+/g,'');
    string = string.replace(/[\s\t ]+/g, '-');
    string = encodeURIComponent(string);
    return string;
}

But it looks quite clunky. Any ideas how close is it to the original?


Solution

  • I agree with @elclanrs, chaining looks more concise:

    function(string) {
        return encodeURIComponent(string.replace(/[A-Z]+/g,function(v) { return v.toLowerCase(); })
                                        .replace(/[^a-z0-9-\s\u00BF-\u1FFF\u2C00-\uD7FF\w]+/g,'')
                                        .replace(/[\s\t ]+/g, '-'));
    }
    

    I wouldn't look for a "wiseass" implementation (such "one regex to rule them all"), this implementation is simple and readable - which would make it easy to maintain.