Search code examples
cregexposix-ere

what is the regular expression to extract second path segment of a URI?


I need to extract only the second path segment of a URI i.e. given the following URI:

/first/second/third/fourth/...

the regex should extract the second string from the URI. An explanation of the solution regex would be greatly appreciated.

I am using POSIX complaint regex library.

EDIT: The solution given by Gumbo works at REtester

But, it doesn't seem to work with the code below:

#include "regex.h"
char *regexp (const char *string, const char *patrn, int *begin, int *end){     
        int i, w=0, len;                  
        char *word = NULL;
        regex_t rgT;
        regmatch_t match;
        wsregcomp(&rgT,patrn,REG_EXTENDED);
        if ((wsregexec(&rgT,string,1,&match,0)) == 0) {
                *begin = (int)match.rm_so;
                *end = (int)match.rm_eo;
                len = *end-*begin;
                word = (char*) malloc(len+1);
                for (i=*begin; i<*end; i++) {
                        word[w] = string[i];
                        w++; }
                word[w]=0;
        }
        wsregfree(&rgT);
        return word;
}

int main(){
    int begin = 0;
    int end = 0;

    char *word = regexp("/first/second/third","^/[^/]+/([^/]*)",&begin,&end);
    printf("ENV %s\n",word);
}

The above prints /first/second instead of only second

EDIT2: Same result with java.util.regex as well.


Solution

  • If you’re just having an absolute URI path, then this regular expression should do it:

    ^/[^/]+/([^/]*)
    

    An explanation:

    • ^/ matches the start of the string followed by a literal /
    • [^/]+/ matches one or more characters except /, followed by a literal /
    • ([^/]*) matches zero or more characters except /.

    The second path segment is then matched by the first group. I used + for the first and * for the second because if the first would also allow a zero length, it wouldn’t be an absolute path any more but a scheme-less URI.