Search code examples
swiftregexweb-scrapingswift3

Swift scraping a webpage using regex or alternative


See updates below first.

I am trying to scrape all the moderators for a specified sub-reddit on reddit. The API only lets you get all the moderators usernames for a sub-reddit, so initially I had gotten all these and then performed an additional request for each of these profiles to get the avatar url. This ended up going past the API limit.

So instead I want to just get the source of the following page and paginate through while collecting the 10 usernames and avatar url's on each page. This will end up polling the website with less requests. I understand how to do the pagination part but for now I am trying to understand how to gather the usernames and adjoining avatar URLs.

So take the following url:

https://www.reddit.com/r/videos/about/moderators/

So I will pull the entire page source,

Add all the mods usernames & urls into a mod object, then into an array.

Would using regex on the string I get back be a good idea?

This is my code so far, any help would be great:

    func tester() {
       let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!

       let task = URLSession.shared.dataTask(with: url) { data, response, error in
           guard let data = data, error == nil else {
               print("\(error)")
               return
           }

        let string = String(data: data, encoding: .utf8)

            let regexUsernames = try? NSRegularExpression(pattern: "href=\"/user/[a-z0-9]\"", options: .caseInsensitive)

            var results = regexUsernames?.matches(in: string as String, options: [], range: NSRange(location: 0, length: string.length))

            let regexProfileURLs = try? NSRegularExpression(pattern: "><img src=\"[a-z0-9]\" style", options: .caseInsensitive)

           print("\(results)") // This shows as empty array
       }

       task.resume()
   }

I have also tried the following but get this error:

Can't form Range with upperBound < lowerBound

Code:

    func tester() {
       let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!

       let task = URLSession.shared.dataTask(with: url) { data, response, error in
           guard let data = data, error == nil else {
            print("data was nil")
               return
           }

        guard let htmlString = String(data: data, encoding: .utf8) else {
            print("cannot cast data into string")
            return
        }

        let leftSideOfValue = "href=\"/user/"
        let rightSideOfValue = "\""

        guard let leftRange = htmlString.range(of: leftSideOfValue) else {
            print("cannot find range left")
            return
        }

        guard let rightRange = htmlString.range(of: rightSideOfValue) else {
            print("cannot find range right")
            return
        }

        let rangeOfTheValue = leftRange.upperBound..<rightRange.lowerBound

        print(htmlString[rangeOfTheValue])
}

UPDATE:

So I have gotten to a point where it will give me the first username, however I am looping and just getting the same one, over and over. What would be the best way to move on each incremental step? Is there a way to do something like let newHTMLString = htmlString.dropFirst(k: ?) to replace the htmlString with a substring that is after the elements we just got?

func tester() {
       let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!

       let task = URLSession.shared.dataTask(with: url) { data, response, error in
           guard let data = data, error == nil else {
            print("data was nil")
               return
           }

        guard let htmlString = String(data: data, encoding: .utf8) else {
            print("cannot cast data into string")
            return
        }


        let counter =  htmlString.components(separatedBy:"href=\"/user/")
        let count = counter.count

        for  i in 0...count {

            let leftSideOfUsernameValue = "href=\"/user/"
            let rightSideOfUsernameValue = "\""

            let leftSideOfAvatarURLValue = "><img src=\""
            let rightSideOfAvatarURLValue = "\">"


          guard let leftRange = htmlString.range(of: leftSideOfUsernameValue) else {
                print("cannot find range left")
                return
            }

            guard let rightRange = htmlString.range(of: rightSideOfUsernameValue) else {
                print("cannot find range right")
                return
            }

            let username = htmlString.slice(from: leftSideOfUsernameValue, to: rightSideOfUsernameValue)
            print(username)
            guard let avatarURL = htmlString.slice(from: leftSideOfAvatarURLValue, to: rightSideOfAvatarURLValue) else {
                print("Error")
                return
            }
            print(avatarURL)

        }

       }

       task.resume()
   }

I have also tried:

           let endString = String(avatarURL + rightSideOfAvatarURLValue)
            let endIndex = htmlString.index(endString.endIndex, offsetBy: 0)
            let substringer = htmlString[endIndex...]
            htmlString = String(substringer)

Solution

  • You should be able to pull all names and urls into two separate arrays by calling a simple regex by doing something like:

    func tester() {
        let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
    
        let task = URLSession.shared.dataTask(with: url) { data, response, error in
            guard let data = data, error == nil else { return }
            guard let htmlString = String(data: data, encoding: .utf8) else { return }
    
            let names = htmlString.matching(regex: "href=\"/user/(.*?)\"")
            let imageUrls = htmlString.matching(regex: "><img src=\"(.*?)\" style")
            print(names)
            print(imageUrls)
        }
        task.resume()
    }
    
    extension String {
        func matching(regex: String) -> [String] {
            guard let regex = try? NSRegularExpression(pattern: regex, options: []) else { return [] }
            let result  = regex.matches(in: self, options: [], range: NSMakeRange(0, self.count))
            return result.map {
                return String(self[Range($0.range, in: self)!])
            }
        }
    }
    

    Or you can create an object for each of the <div class="_1sIhmckJjyRyuR_z7M5kbI"> and then grab the names and urls to use as required.