I am trying to create an iOS app simply to extract the section of a web page.
I have the code working to connect to the URL and store the HTML in an NSString
I have tried this, but I am just getting null strings for my result
NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
// Create a new scanner and give it the html data to parse.
while (![newScanner isAtEnd])
{
[newScanner scanUpToString:@"<body>" intoString:NULL];
// Scam until <body> tag is found
[newScanner scanUpToString:@"</body>" intoString:&bodyText];
// Everything up to the end tag will get placed into the memory address of the result string
}
I have tried an alternative way...
NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
// Create a new scanner and give it the html data to parse.
while (![newScanner isAtEnd])
{
[newScanner scanUpToString:@"<body" intoString:NULL];
// Scam until <body> tag is found
[newScanner scanUpToString:@">" intoString:NULL];
// Go to end of opening <body> tag
[newScanner scanUpToString:@"</body>" intoString:&bodyText];
// Everything up to the end tag will get placed into the memory address of the result string
}
This second way returns a string which starts with >< script...
etc
If Im honest I don't have a good URL to test this with and I think It may be easier with some help on removing the tags within the body too (like <p></p>
)
Any help would be very much appriciated
I don't know why your first method didn't work. I assume you defined bodyText before that snippet. This code worked fine for me,
- (void)viewDidLoad {
[super viewDidLoad];
NSString *htmlData = @"This is some stuff before <body> this is the body </body> with some more stuff";
NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
NSString *bodyText;
while (![newScanner isAtEnd]) {
[newScanner scanUpToString:@"<body>" intoString:NULL];
[newScanner scanString:@"<body>" intoString:NULL];
[newScanner scanUpToString:@"</body>" intoString:&bodyText];
}
NSLog(@"%@",bodyText); // 2015-01-28 15:58:00.360 ScanningOfHTMLProblem[1373:661934] this is the body
}
Notice that I added a call to scanString:intoString:
to get past the first "<body>"
.