Search code examples
htmlregexswiftuiimageview

How to get image url from html string using regex


I am trying to get from html string using regex which I am currently working on was this :

extension String {
func regex (pattern: String) -> [String] {
    do {
        let regex = try NSRegularExpression(pattern: pattern, options: NSRegularExpressionOptions(rawValue: 0))
        let nsstr = self as NSString
        let all = NSRange(location: 0, length: nsstr.length)
        var matches : [String] = [String]()
        regex.enumerateMatchesInString(self, options: NSMatchingOptions(rawValue: 0), range: all) {
            (result : NSTextCheckingResult?, _, _) in
            if let r = result {
                let result = nsstr.substringWithRange(r.range) as String
                matches.append(result)
            }
        }
        return matches
    } catch {
        return [String]()
    }
}

And the pattern is : <img[^>]+src\\s*=\\s*['\']([^'\"]+)['\"][^>]*>

I still can't get the image url from it which mean it return me empty array.Actually my html string include one image.I don't want to useUIWebView because of UITableView resizing problem.So,I need to fetch the image url out of html and show it in UIImageView using AlamofireImage.

Any Help?It was just one url that i need to fetch.

Here is my tag :

<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>

To :

https://en.wikipedia.org/wiki/File:BH_LMC.png

Solution

  • Description

    <img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?> 
    

    Regular expression visualization

    This regular expression will do the following:

    • This regex captures the entire IMG tag
    • Places the source attribute value into capture group 1, without quotes if they exist.
    • Allow attributes to have single, double or no quotes
    • Can be modified to validate any number of other attributes
    • Avoid difficult edge cases which tend to make parsing HTML difficult

    Example

    Live Demo

    https://regex101.com/r/qW9nG8/1

    Sample text

    Note the difficult edge case in the first line where we are looking for a specific droid.

    <img onmouseover=' if ( 6 > 3 { funSwap(" src="NotTheDroidYourLookingFor.jpg", 6 > 3 ) } ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
    some text
    
    <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
    more text
    <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
    

    Sample Matches

    • Capture group 0 gets the entire IMG tag
    • Capture group 1 gets just the src attribute value
    [0][0] = <img onmouseover=' funSwap(" src='NotTheDroidYourLookingFor.jpg", data-pid) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
    [0][1] = http://website/ThisIsTheDroidYourLookingFor.jpeg
    
    [1][0] = <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
    [1][1] = http://website/someurl.jpeg
    
    [2][0] = <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
    [2][1] = https://en.wikipedia.org/wiki/File:BH_LMC.png
    

    Explanation

    NODE                     EXPLANATION
    ----------------------------------------------------------------------
      <img                     '<img'
    ----------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    ----------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    ----------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
    ----------------------------------------------------------------------
      )                        end of look-ahead
    ----------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    ----------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the least amount
                                 possible)):
    ----------------------------------------------------------------------
          [^>=]                    any character except: '>', '='
    ----------------------------------------------------------------------
         |                        OR
    ----------------------------------------------------------------------
          ='                       '=\''
    ----------------------------------------------------------------------
          [^']*                    any character except: ''' (0 or more
                                   times (matching the most amount
                                   possible))
    ----------------------------------------------------------------------
          '                        '\''
    ----------------------------------------------------------------------
         |                        OR
    ----------------------------------------------------------------------
          ="                       '="'
    ----------------------------------------------------------------------
          [^"]*                    any character except: '"' (0 or more
                                   times (matching the most amount
                                   possible))
    ----------------------------------------------------------------------
          "                        '"'
    ----------------------------------------------------------------------
         |                        OR
    ----------------------------------------------------------------------
          =                        '='
    ----------------------------------------------------------------------
          [^'"]                    any character except: ''', '"'
    ----------------------------------------------------------------------
          [^\s>]*                  any character except: whitespace (\n,
                                   \r, \t, \f, and " "), '>' (0 or more
                                   times (matching the most amount
                                   possible))
    ----------------------------------------------------------------------
        )*?                      end of grouping
    ----------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
    ----------------------------------------------------------------------
        src=                     'src='
    ----------------------------------------------------------------------
        ['"]                     any character of: ''', '"'
    ----------------------------------------------------------------------
        (                        group and capture to \1:
    ----------------------------------------------------------------------
          [^"]*                    any character except: '"' (0 or more
                                   times (matching the most amount
                                   possible))
    ----------------------------------------------------------------------
        )                        end of \1
    ----------------------------------------------------------------------
        ['"]?                    any character of: ''', '"' (optional
                                 (matching the most amount possible))
    ----------------------------------------------------------------------
      )                        end of look-ahead
    ----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the most amount possible)):
    ----------------------------------------------------------------------
        [^>=]                    any character except: '>', '='
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        ='                       '=\''
    ----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
        '                        '\''
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        ="                       '="'
    ----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
        "                        '"'
    ----------------------------------------------------------------------
       |                        OR
    ----------------------------------------------------------------------
        =                        '='
    ----------------------------------------------------------------------
        [^'"\s]*                 any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and " ") (0
                                 or more times (matching the most amount
                                 possible))
    ----------------------------------------------------------------------
      )*                       end of grouping
    ----------------------------------------------------------------------
      "                        '"'
    ----------------------------------------------------------------------
      \s?                      whitespace (\n, \r, \t, \f, and " ")
                               (optional (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      \/?                      '/' (optional (matching the most amount
                               possible))
    ----------------------------------------------------------------------
      >                        '>'
    ----------------------------------------------------------------------