Search code examples
javascripturldomfirefox-addonfirefox-addon-webextensions

Extracting a base domain/eTLD+1 from a URL


I'm currently writing a WebExtension. In this extension, I need to deal with a bunch of URLs in JS and extract the base domain (aka eTLD+1).

So

  • www.cnn.com => cnn.com
  • cnn.com => cnn.com
  • www.world.cnn.com => cnn.com
  • www.bbc.co.uk => bbc.co.uk
  • ...

As you can see from the examples, there is no simple technique to extract everything. In fact, the the official list is ~12,000 lines long.

I know that browsers can do it internally. I wonder if there is a standard way to do this in JS?


Solution

  • Maybe too late but:

    For browser usage, there is the publicsuffixlist.js implementation by Raymond Hill (uBlock origin author) that works well, and you can also optionally use WASM for better performance. You need also punycode.js.

    simple usage (once you have publicsuffix.min.js and punycode.js) :

    // at this point you have the publicsuffix list in a string
    const publicSuffixList = "must contain list from https://publicsuffix.org/list/public_suffix_list.dat";
    window.publicSuffixList.parse(publicSuffixList, punycode.toASCII);
    
    // optionnal enable wasm : need that you serve the WASM file with MIME type 
    // "Content-Type: application/wasm"
    window.publicSuffixList.enableWASM().then(status => {
       console.log("WASM status: ", status);
    });
    
    const host = "www.bbc.co.uk";
    const hostPuny = punycode.toASCII(host);
    const domain = window.publicSuffixList.getDomain(hostPuny);
    console.log("eTLD+1 : ", punycode.toUnicode(domain));