Search code examples
node.jsurl

How to normalize a URL?


I am dealing with a situation where I need users to enter various URLs (for example: for their profiles). However, users do not always insert URLs in the https://example.com format. They might insert something like:

  • example.com
  • example.com/
  • example.com/somepage
  • but something like [email protected] or something else should not be acceptable

How can I normalize the URLs to a format that can potentially lead to a web address? I see this behavior in web browsers. We almost always enter crappy things in a web browser's bar and they can distinguish whether that's a search or something that can be turned into a URL.

I tried looking in many places but seems like I can't find any approach to this.

I would prefer a solution written for Node if it's possible. Thank you very much!


Solution

  • Use node's URL API, alongside some manual checks.

    1. Manually check that the URL has a valid protocol.
    2. Instantiate the URL.
    3. Check that the URL does not contain additional information.

    Example code:

    const { URL } = require('url')
    let myTestUrl = 'https://user:[email protected]:8080/p/a/t/h?query=string#hash';
    
    try {
      if (!myTestUrl.startsWith('https://') && !myTestUrl.startsWith('http://')) {
        // The following line is based on the assumption that the URL will resolve using https.
        // Ideally, after all checks pass, the URL should be pinged to verify the correct protocol.
        // Better yet, it should need to be provided by the user - there are nice UX techniques to address this.
        myTestUrl = `https://${myTestUrl}`
      }
    
      const normalizedUrl = new URL(myTestUrl);
    
      if (normalizedUrl.username !== '' || normalized.password !== '') {
        throw new Error('Username and password not allowed.')
      }
    
      // Do your thing
    } catch (e) {
      console.error('Invalid url provided', e)
    }
    

    I have only used http and https in this example, for a gist.

    Straight from the docs, a nice visualisation of the API:

    ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                            href                                             │
    ├──────────┬──┬─────────────────────┬─────────────────────┬───────────────────────────┬───────┤
    │ protocol │  │        auth         │        host         │           path            │ hash  │
    │          │  │                     ├──────────────┬──────┼──────────┬────────────────┤       │
    │          │  │                     │   hostname   │ port │ pathname │     search     │       │
    │          │  │                     │              │      │          ├─┬──────────────┤       │
    │          │  │                     │              │      │          │ │    query     │       │
    "  https:   //    user   :   pass   @ sub.host.com : 8080   /p/a/t/h  ?  query=string   #hash "
    │          │  │          │          │   hostname   │ port │          │                │       │
    │          │  │          │          ├──────────────┴──────┤          │                │       │
    │ protocol │  │ username │ password │        host         │          │                │       │
    ├──────────┴──┼──────────┴──────────┼─────────────────────┤          │                │       │
    │   origin    │                     │       origin        │ pathname │     search     │ hash  │
    ├─────────────┴─────────────────────┴─────────────────────┴──────────┴────────────────┴───────┤
    │                                            href                                             │
    └─────────────────────────────────────────────────────────────────────────────────────────────┘