Search code examples
htmlregexmarkdownjekyll

Markdown/Jekyll: Auto-surround bare URLs with angle brackets?


I notice that some Markdown parsers and GitHub will auto-convert bare URLs to links, but others (like Kramdown) don't. The standard Markdown syntax requires that URLs be wrapped in angle brackets, e.g. <https://www.google.com/>.

I have a number of documents with bare URLs that appear as desired, i.e. as hyperlinks, in my Markdown editor but are not getting rendered as links when I push them in Jekyll to GitHub Pages.

How can I write a script to surround bare URLs with angle brackets? Preferably via shell scripting, standard command line tools (sed, awk) or Python. Or perhaps there's already a Jekyll plugin for this?

I know that matching URLs is highly nontrivial, so wanted to ask here on SO before getting too deep into this.

Further difficulty: The solution should only change bare URLs, and leave alone URLs that have already been wrapped/encoded via standards-compliant Markdown or HTML.

(I expected this to be a common question, and it is in various GitHub-Issues posts for various packages, with no solutions... But tried searching for this question here and couldn't find it already asked, nor any premade Jekyll solutions. I found many questions about matching when the angle brackets are already there, but not ones to add the angle brackets. Yet I'm imagining the solution has been implemented many, many times -- in the very tools we use, such as GitHub and MathOverflow -- so, not sure why the means to do this isn't widely posted.)


Solution

  • You may try the below regex:

    (?!<)^(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&\/\/=]*))(?!>)$
    

    Explanation of the above regex:

    • (?!<) - Represents negative look-ahead not matching the string if it starts with a <.

    • ^, $ - Represents start and end of line respectively.

    • (https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&\/\/=]*)) - This part matches all the possible valid urls efficiently.

    • (?!>) - Represents negative look-ahead not matching if the url ends in >.

    Pictorial Representation

    You can find the demo of the above regex in here.


    NOTE: I also prefer using perl command if it comes to implementation in bash. But if it is your necessary requirement to use sed then you can try the below command. However; please be noted that sed misses many amazing features of regex namely; look-arounds, non-captured groups, etc.

    sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&/=]*))[^>]?$@<\1>@gm'
    

    You can find sample run of the perl and sed implementation in here.