Primary Goal
Do exactly what this page does: textise.net
Secondary Goal
Provide a reader-friendly version of the website like with Reader View in Safari.
The Hard Way
I wrote a custom WKWebView class with a custom navigation delegate that implements this function to get the HTML code:
- (void)getHTMLCodeWithCompletionHandler:(void (^)(NSString *htmlCode))completionHandler
I use the HTMLKit library that lets me parse the HTML code and search through the DOM. This is how all works:
#pragma mark - SNWebViewNavigationDelegate
- (void)webViewDidFinishNavigation:(SNWebView *)webView {
[webView getHTMLCodeWithCompletionHandler:^(NSString *htmlCode){
HTMLParser *parser = [[HTMLParser alloc] initWithString:htmlCode];
HTMLDocument *document = [parser parseDocument];
// ...
}];
}
I'm using this function to parse for child elements and sibling elements (from this list) that might contain text. Unfortunately, this doesn't always work. For many sites, text is nested deep in structures I have no access to or scripts that need to run.
The Easy Way
Reverse engineering a method Apple already uses for a different purpose. For example, there is a method to search a web page for text:
- (void)findString:(NSString *)string
withConfiguration:(WKFindConfiguration *)configuration
completionHandler:(void (^)(WKFindResult *result))completionHandler;
You get back just a BOOL variable on whether the text is found or not. No way to get the text it used to perform the search on.
You could do something as simple as the following:
NSAttributedString *attributedStringFromHTML = [[NSAttributedString alloc] initWithData:[htmlString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute:@(NSUTF8StringEncoding)} documentAttributes:nil error:nil];
NSString *stringResult = [attributedStringFromHTML string];
But there are a lot of downsides to this. One main concern is that HTML->Attributed String might be pretty slow, depending on the minimal iOS version you should support it might need to run on the main thread and last - it is kind of not optimal to just turn HTML to text. It would need some additional separation (newlines, spaces, etc.).