Search code examples
javascriptpythonregexweb-scrapingscraper

Extracting data from JavaScript (Python Scraper)


I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.

JavaScript:

(function(){DOM.appendContent(this, HTML("<html>"));;})

I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.

Any thoughts?


Solution

  • Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?

    string[42:-7]
    

    As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.