Search code examples
javascriptregexregex-group

Regex to Extract Environment, Domain, and Hostname from URL with Variable Subdomains


I am working on a project where I need to extract specific information from URLs, specifically the environment, domain, and hostname. The URLs have variable subdomains, and I'm having difficulty constructing a regex pattern to capture the required groups.

enter image description here

Link: https://regex101.com/r/4DhLns/3

I need help crafting a regex pattern that can efficiently capture the following groups:

  • Group 1: environment (e.g., stage, qa)
  • Group 2: hostname (e.g., hostname)
  • Group 3: domain (e.g., com)

const regex = /.*(?<environment>(qa|stage*)).*\.(?<hostname>\w+)*\.(?<domain>\w+)$/;

function extractInfoFromURL(url) {
    const match = url.match(regex);
    
    if (match) {
        return match.groups;
    } else {
        return null; // URL didn't match the pattern
    }
}

const testUrls = [
    "https://example.test.qa.sub.hostname.com",
    "https://example.test.stage.coonect.hostname.com",
    "https://example.qa.hostname.com",
    "https://example.hostname.com",
    "https://example.stage.hostname.com",
    "https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-qa-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-qa.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-stage.apps.sub-test.minor.qa.test.sub.hostname.com"
];

testUrls.forEach((url, index) => {
    const result = extractInfoFromURL(url);
    
    if (result) {
        console.log(`Result for URL ${index + 1}:`, result);
    } else {
        console.log(`URL ${url} did not match the pattern.`);
    }
});

Here, the issue is with: https://example.hostname.com, env should be null here and the domain and host should be present.

RexEx101: https://regex101.com/r/aCCWRv/2


Solution

  • The first part .*? can be omitted from the pattern. If there can not be spaces in the match, then .*? could be \S*? matching as least as possible non whitespace characters.

    The named group already is a group, so you don't have to specify another separate capture group inside it.

    If the environment is optional, then you can use an optional non capture group until the part where the "hostname" starts.

    The leading \b is a word boundary to prevent a partial word match.

    Currently you are using \w which might be limited to match the allowed characters. You could extend it using a character class [...] specifying all allowed characters.

    \b(?:(?<environment>qa|stage|dev|preprod)\S*?\.)?(?<hostname>\w+)\.(?<domain>\w+)$
    

    Regex demo

    const regex = /\b(?:(?<environment>qa|stage|dev|preprod)\S*?\.)?(?<hostname>\w+)\.(?<domain>\w+)$/;
    
    function extractInfoFromURL(url) {
      const match = url.match(regex);
    
      if (match) {
        return match.groups;
      } else {
        return null; // URL didn't match the pattern
      }
    }
    
    const testUrls = [
      "https://example.test.qa.sub.hostname.com",
      "https://example.test.stage.coonect.hostname.com",
      "https://example.qa.hostname.com",
      "https://example.hostname.com",
      "https://example.stage.hostname.com",
      "https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
      "https://ops-cert-qa-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
      "https://ops-cert-qa.apps.sub-test.minor.qa.test.sub.hostname.com",
      "https://ops-cert-stage.apps.sub-test.minor.qa.test.sub.hostname.com"
    ];
    
    testUrls.forEach((url, index) => {
      const result = extractInfoFromURL(url);
    
      if (result) {
        console.log(`Result for URL ${index + 1}:`, result);
      } else {
        console.log(`URL ${url} did not match the pattern.`);
      }
    });