Search code examples
pythonregexweb-scrapingmechanize-python

Trying to match a regular expression on a website using Mechanize and python


I'm trying to eventually populate a google sheet from data I'm scraping from wikipedia. ( I'll deal with the robots.txt file later I'm just trying to figure out how to do this conceptually. My code is below. I'm trying to put the page in as a string and then run a regexp search my goal is to isolate the specs on the page and at least store them as a value but I'm having a problem searching the page keeps coming up as did not find

Be gentle I'm a noob - Thanks in advance for your help!

import mechanize
import re
import gspread


br = mechanize.Browser()

pagelist=["https://en.wikipedia.org/wiki/Tesla_Model_S"]

wheelbase = ''
length =''
width= ''
height =''





pages=len(pagelist)
i=0



br.open(pagelist[0])

page = br.response()
print page.read()

pageAsaString = str(page.read())



match = re.search('Wheelbase',pageAsaString)
if match:                      
    print 'found', match.group() 
else:
print 'did not find'

Solution

  • I get the page just fine - the reason that you're getting a message saying that the page couldn't be found is because your print 'did not find' block isn't properly indented. This matters in Python! Bump it over 4 spaces:

    if match:                      
        print 'found', match.group() 
    else:
        print 'did not find'
    

    There's one other thing. I'm not familiar with Mechanize, but you're just calling read() on the page, which exhausts it. So, when you read() the page in print page.read(), there isn't anything left to consume and assign to pageAsaString. You've already read to the end of the page! So you'll want to read the page and save it to a variable first. Check out the documentation for IO operations here.

    After fixing the indentation and removing print page.read(), everything appeared to work just fine.

    Since you're starting out, I highly recommend reading Dive Into Python. Good luck with your project!