Search code examples
javascripthtmlobjective-cwkwebview

How to get all rendered text from a web page in a WKWebView?


Primary Goal

Do exactly what this page does: textise.net

Secondary Goal

Provide a reader-friendly version of the website like with Reader View in Safari.

The Hard Way

I wrote a custom WKWebView class with a custom navigation delegate that implements this function to get the HTML code:

- (void)getHTMLCodeWithCompletionHandler:(void (^)(NSString *htmlCode))completionHandler

I use the HTMLKit library that lets me parse the HTML code and search through the DOM. This is how all works:

#pragma mark - SNWebViewNavigationDelegate

- (void)webViewDidFinishNavigation:(SNWebView *)webView {
    
    [webView getHTMLCodeWithCompletionHandler:^(NSString *htmlCode){
        
        HTMLParser *parser = [[HTMLParser alloc] initWithString:htmlCode];
    
        HTMLDocument *document = [parser parseDocument];

        // ...
    }];
}

I'm using this function to parse for child elements and sibling elements (from this list) that might contain text. Unfortunately, this doesn't always work. For many sites, text is nested deep in structures I have no access to or scripts that need to run.

The Easy Way

Reverse engineering a method Apple already uses for a different purpose. For example, there is a method to search a web page for text:

- (void)findString:(NSString *)string 
 withConfiguration:(WKFindConfiguration *)configuration 
 completionHandler:(void (^)(WKFindResult *result))completionHandler;

You get back just a BOOL variable on whether the text is found or not. No way to get the text it used to perform the search on.


Solution

  • You could do something as simple as the following:

    NSAttributedString *attributedStringFromHTML = [[NSAttributedString alloc] initWithData:[htmlString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute:@(NSUTF8StringEncoding)} documentAttributes:nil error:nil];
    
    NSString *stringResult = [attributedStringFromHTML string];
    

    But there are a lot of downsides to this. One main concern is that HTML->Attributed String might be pretty slow, depending on the minimal iOS version you should support it might need to run on the main thread and last - it is kind of not optimal to just turn HTML to text. It would need some additional separation (newlines, spaces, etc.).