Search code examples
pythonparsingwebrequesturllib2

Parsing a certain webpage in python


I'm trying to split on every instance of "href" in a between two certain tags. to be specific here's what I'm working with: `

req = urllib2.Request('http://tv1.alarab.com/')
response = urllib2.urlopen(req)
link = response.read()
target = re.findall(r'<div id="nav">(.*?)</div>', link, re.DOTALL)
for items in target:
    mypath = items.split(' href="/')[1].split('/')[0]
    print mypath

Here's what it prints out:

view-5553

It's only printing the first instance. On another website, I'm using the exact same approach and it prints all the instances on when it meets an "href"

Here's what I have for another website:

req = urllib2.Request('http://www.shahidlive.co')
response = urllib2.urlopen(req)
link = response.read()
target = re.findall(r'<ul class="hidden-xs">(.*?)</ul>', link, re.DOTALL)
for items in target:
    mypath = items.split('href="')[1].split('">')[0]
    print mypath

Here's what this one prints out, which is basically what I want the first piece of code to print out:

/Album-1104708-1/
/Cat-134-1
/Cat-100-1
/Album-1104855-1/
/Cat-121-1

I tried running the debugger and it seems like the for loop is only iterating once for the first website. I'm not sure why or what's going on. Any help would be appreciated.


Solution

  • First of all, using regex to parse structured data like XML/HTML/JSON etc. is an extremely bad idea - in your example if there was a structure like:

    <div id="nav">
        <div>
            <span>whatever</span>
        </div>
        <a href="http://some.link/path">this is the link you want</a>
    </div>
    

    you'd get a diddly-squat as the regex would end on the first </div> occurrence due to the non-greedy qualifier. On the other hand, if it was greedy you'd go over the whole document ignoring any other <div id="nav"> instances (which should be illegal in HTML but script-kiddies destroyed HTML a long time ago so now anything goes, but I digress...).

    However, in your particular case the issue is actually with your inner split logic - your regex will capture a single group (because there is only one <div id="nav"> on the page so it will capture everything until the first </div> tag):

    <div id="nav">
    <ul id="navbar">
            <li  id="d5553"><a title="..." href="/view-5553/">...</a></li><li  id="d1"><a title="..." href="/view-1/">...</a></li><li  id="d295"><a title="..." href="/view-295/">...</a></li><li  id="d6181"><a title="..." href="/view-6181/">...</a></li><li  id="d297"><a title="..." href="/view-297/">...</a></li><li  id="d311"><a title="..." href="/view-311/">...</a></li><li id="d5807"><a title="" href="/view-5807/">...</a></li><li  id="d10"><a title="..." href="/view-10/">...</a></li><li  id="d313"><a title="..." href="/view-313/">...</a></li><li  id="d1951"><a title="..." href="/view-1951/">...</a></li><li  id="d299"><a title="..." href="/view-299/">...</a></li><li  id="d8"><a title="..." href="/view-8/">...</a></li><li  id="d4"><a title="..." href="/view-4/">...</a></li><li  id="d309"><a title="..." href="/view-309/">...</a></li><li  id="d5573"><a title="..." href="/view-5573/">...</a></li>        
    </ul>
    
    </div>
    

    (I've replaced the squiggly stuff with ... for the sake of readability)

    So, when you call your split() routine on it you only get one value - the first view-5553. If you want to capture the rest of href values in that block you'll have to split on href="/ and iterate through the list to pick up individual entries (ending at the first next "), or you can use regex as well:

    mypath = re.findall(r' href="/(.*?)/?"', items)
    # ['view-5553', 'view-1', 'view-295', 'view-6181', 'view-297', 'view-311', 'view-5807',
    # 'view-10', 'view-313', 'view-1951', 'view-299', 'view-8', 'view-4', 'view-309',
    # 'view-5573']
    

    (this is with my replacements, your actual code will give you the actual links).

    And just to mention it again - regex is not the right tool for HTML parsing, save yourself some trouble and go with at least BeautifulSoup.