Search code examples
swiftuiwebviewgrand-central-dispatchwkwebviewdispatch-queue

Swift Load Website to Scrape Code Without Loading View | WebKit


I have an array of Google News article urls. Google News article urls redirect immediately to real urls, ie: CNBC.com/.... I am trying to pull out the real, redirected url. I thought I could loop through the list and load the Google News link in a WebView, then call webView.url in a DispatchQueue after 1 second to get the real url, but this doesn't work.

How could you fetch a list of redirected urls quickly?

Here's my code you could use to reproduce the problem:

        let webView = WKWebView()
        let myList = [URL(string: "https://news.google.com/articles/CAIiEDthIxbgofssGWTpXgeJXzwqGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_d7gU?hl=en-US&gl=US&ceid=US%3Aen"), URL(string: "https://news.google.com/articles/CAIiEP5m1nAOPt-LIA4IWMOdB3MqGQgEKhAIACoHCAowocv1CjCSptoCMPrTpgU?hl=en-US&gl=US&ceid=US%3Aen")]

        for url in myList {
            guard let link = url else {continue}
            self.webView.loadUrl(string: link.absoluteString)

            DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) {
                let redirectedLink = self.webView.url
                print("HERE redirected url: ", redirectedLink) // this does not work
            }
        }

Solution

  • There are two problems with your attempt:

    1) You're using one and the same web view in the loop and since nothing inside the loop blocks until the web view has finished loading, you just end up cancelling the previous request with every loop pass.

    2) Even if you did block inside the loop, accessing the URL after a second won't work reliably since the navigation could easily take longer than that.

    What I would recommend doing is to continue using a single web view (to save resources) but to use its navigation delegate interface for resolving the URLs one by one.

    This is a crude example to give you a basic idea:

    import UIKit
    import WebKit
    
    @objc class RedirectResolver: NSObject, WKNavigationDelegate {
    
        private var urls: [URL]
        private var resolvedURLs = [URL]()
        private let completion: ([URL]) -> Void
        private let webView = WKWebView()
    
        init(urls: [URL], completion: @escaping ([URL]) -> Void) {
            self.urls = urls
            self.completion = completion
            super.init()
            webView.navigationDelegate = self
        }
    
        func start() {
            resolveNext()
        }
    
        private func resolveNext() {
            guard let url = urls.popLast() else {
                completion(resolvedURLs)
                return
            }
            let request = URLRequest(url: url)
            webView.load(request)
        }
    
        func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
            resolvedURLs.append(webView.url!)
            resolveNext()
        }
    
    }
    
    
    class ViewController: UIViewController {
    
        private var resolver: RedirectResolver!
    
        override func viewDidLoad() {
            super.viewDidLoad()
    
            resolver = RedirectResolver(
                urls: [URL(string: "https://news.google.com/articles/CAIiEDthIxbgofssGWTpXgeJXzwqGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_d7gU?hl=en-US&gl=US&ceid=US%3Aen")!, URL(string: "https://news.google.com/articles/CAIiEP5m1nAOPt-LIA4IWMOdB3MqGQgEKhAIACoHCAowocv1CjCSptoCMPrTpgU?hl=en-US&gl=US&ceid=US%3Aen")!],
                completion: { urls in
                    print(urls)
                })
            resolver.start()
        }
    
    }
    

    This outputs the following resolved URLs:

    [https://amp.cnn.com/cnn/2020/04/09/politics/trump-coronavirus-tests/index.html, https://www.cnbc.com/amp/2020/04/10/asia-markets-coronavirus-china-inflation-data-currencies-in-focus.html]
    

    One other thing to note is that the redirection of those URLs in particular seems to rely on JavaScript which means you indeed need a web view. Otherwise kicking off URLRequests manually and observing the responses would have been enough.