Search code examples
htmliosobjective-cnsscanner

Objective C Using NSScanner to obtain <body> from html


I am trying to create an iOS app simply to extract the section of a web page.

I have the code working to connect to the URL and store the HTML in an NSString

I have tried this, but I am just getting null strings for my result

    NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
    // Create a new scanner and give it the html data to parse.

    while (![newScanner isAtEnd])
    {
        [newScanner scanUpToString:@"<body>" intoString:NULL];
        // Scam until <body> tag is found

        [newScanner scanUpToString:@"</body>" intoString:&bodyText];
        // Everything up to the end tag will get placed into the memory address of the result string

    }

I have tried an alternative way...

    NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
    // Create a new scanner and give it the html data to parse.

    while (![newScanner isAtEnd])
    {
        [newScanner scanUpToString:@"<body" intoString:NULL];
        // Scam until <body> tag is found

        [newScanner scanUpToString:@">" intoString:NULL];
        // Go to end of opening <body> tag

        [newScanner scanUpToString:@"</body>" intoString:&bodyText];
        // Everything up to the end tag will get placed into the memory address of the result string

    }

This second way returns a string which starts with >< script... etc

If Im honest I don't have a good URL to test this with and I think It may be easier with some help on removing the tags within the body too (like <p></p>)

Any help would be very much appriciated


Solution

  • I don't know why your first method didn't work. I assume you defined bodyText before that snippet. This code worked fine for me,

    - (void)viewDidLoad {
        [super viewDidLoad];
        NSString *htmlData = @"This is some stuff before <body> this is the body </body> with some more stuff";
        NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
        NSString *bodyText;
        while (![newScanner isAtEnd]) {
            [newScanner scanUpToString:@"<body>" intoString:NULL];
            [newScanner scanString:@"<body>" intoString:NULL];
            [newScanner scanUpToString:@"</body>" intoString:&bodyText];
        }
        NSLog(@"%@",bodyText); // 2015-01-28 15:58:00.360 ScanningOfHTMLProblem[1373:661934] this is the body 
    }
    

    Notice that I added a call to scanString:intoString: to get past the first "<body>".