Search code examples
pythonhtmlgoogle-app-engineparsinghtml-content-extraction

Parsing fixed-format data embedded in HTML in python


I am using google's appengine api

from google.appengine.api import urlfetch

to fetch a webpage. The result of

result = urlfetch.fetch("http://www.example.com/index.html")

is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.

EDIT: Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way... END EDIT

If the document is something like this:

<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A       288        AAA
</body></html>

result.content will be this, after urlfetch fetches it:

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>'

Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried

result.content.split('\n')

and

result.content.split('\r')

but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.

Any ideas how I can parse this data? Maybe I need to fetch it differently?

Thanks in advance!


Solution

  • I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.

    I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like

    import re
    data = re.findall('<body>([^\<]*)</body>', result)[0]
    

    then, it should be as easy as:

    start = 0
    end = 5
    while (end<len(data)):
       print data[start:end]
       start = end+1
       end = end+5
    print data[start:]
    

    (note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)