Search code examples
c#regexstringsubstringstartswith

C# Extract part of the string that starts with specific letters


I have a string which I extract from an HTML document like this:

    var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[@class='a-size-small a-link-normal a-text-normal']");
    if (elas != null)
   {
   //
     _extractedString = elas.Attributes["href"].Value;
   }

The HREF attribute contains this part of the string:

gp/offer-listing/B002755TC0/

And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...

Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?

For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:

offer-listing/

So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?

Can someone help me out with this ?


Solution

  • This is a perfect job for a regular expression :

    string text = "gp/offer-listing/B002755TC0/";
    
    Regex pattern = new Regex(@"offer-listing/(\w+)/");
    
    Match match = pattern.Match(text);
    string whatYouAreLookingFor = match.Groups[1].Value;
    

    Explanation : we just match the exact pattern you need.

    • 'offer-listing/'
    • followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
    • followed by a slash.

    The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).


    EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/

    Then you could use this pattern :

    Regex pattern = new Regex(@"/(\w+)/$");
    

    which will match both this string and the previous. The $ stands for the end of the string, so this literally means :

    capture the characters in between the last two slashes of the string