Search code examples
htmlregexstringreact-nativeregex-lookarounds

Regex on html image string


I am trying to get the image id (which is at the end of the src link, right before file type) from this html but for some reason the regex I've written isn't working. Accessing document object in not an option in this case which is why i need regex for it. Any help would be appreciated

thats what I have so far but its failing at sizes check

const imgRegX = /<div class="?preview item"?[^>]*>\s*<img alt="?" sizes= "?"/g;

Here is how the string looks:

<div class="preview item"><img alt=""
sizes="(max-width: 440px) 320px"
src= "https://m.testlink.com/test/zx320y230c_4130512.jpg"
srcset= "https://m.testlink.com/test/zx320y230c_4130512.jpg 320w, https://m.testlink.com/test/zx640y460c_4130512.jpg 640w"></div>

Solution

  • The following should do what you need; I've simplified it a bit by excluding the sizes and alt attributes, since you apparently do not need them;

    /<div\s+class="preview item"[^>]*>\s*<img\s+[\s\S]*?src=\s?".*?([^\/]+?)"/gi
    

    There is at least one major misunderstanding here, and that is your usage of the question mark. The question mark(?) is a quantifier, and in this case means "match 0 or 1 of the preceding character", but only when the preceding character is not a quantifier itself(I may be wrong on this but that's been my understanding). In that case, it becomes a "lazy" flag, meaning that instead of being greedy(match as many times as possible), it matches the preceding quantifier's pattern as few times as possible.

    In order to match your string and get your desired ID, we first use a \s whitespace character class(any whitespace character), matching 1 or more times(+ means 1 or more). The rest of the regex up until the image point remains mostly unchanged.

    After the image tag's start, we match at 1 or more space characters, before we match 0 or more space and non-space characters(\S is a non-space; a shorthand character class put in another character class([]) to combine them), matching as few times as possible.

    Finally, we get to the src attribute; here, we precede the attribute contents(in quotes) with an optional space, before a standard double quote(which you may need to change to ["'] if the quotes change at all), followed by 0 or any number of any characters(.(dot) matches any character), matching as few times as possible before being followed by a capturing group(()), which contains a match of any non-forward slash character(escaped so it doesn't break the regex), matching 1 or more times, as few times as possible, before finally reaching the final closing quote mark.

    I use the lazy flag multiple times, as in my experience, if the lazy flag is not used, then the matched pattern has the potential to exceed the succeeding character.

    I added in the i flag in order to make the search case-insensitive, though you may need to change that depending on how case-sensitive you want your pattern to be.

    Here is a demo of the regex in action:

    let reg = /<div\s+class="preview item"[^>]*>\s*<img\s+[\s\S]*?src=\s?".*?([^\/]+?)"/gi;
    let str = `<div class="preview item"><img alt=""
    sizes="(max-width: 440px) 320px"
    src= "https://m.testlink.com/test/zx320y230c_4130512.jpg"
    srcset= "https://m.testlink.com/test/zx320y230c_4130512.jpg 320w, https://m.testlink.com/test/zx640y460c_4130512.jpg 640w"></div>`
    
    console.log(reg.exec(str)[1]);

    Of note, in regards to the above snippet, the capturing group is keyed to the position 1 in the object returned by .exec().

    Lastly, here is a demo from Regex101, my go-to regex debugging site.

    For all other learning purposes I highly recommend regular-expressions.info, it's how I learned it myself.