Search code examples
pythonregexasciiencoding

How do you parse and match the keyword in search engine url using python re module?


Example from Google:

http://www.google.com.co/url?sa=t&rct=j&q=pedro%20gomez%20proyecto%20en%20la%20ciudad%20de%20valledupar&source=web&cd=10&ved=0CFsQFjAJ&url=http%3A%2F%2Fwww.21molino.com%2F1410%2F8911.html

or from Bing search:

http://www.bing.com/search?q=10%2F30+Sand&src=IE-SearchBox&FORM=IE8SRC

I want parse and match ?q= or q= keywords, using (?<=)? with the python re module. How can you can pass the multiple parameters in encode the ascii url to utf-8 so that it can be read?

Need some help here, thanks very much : )


Solution

  • Try this:

    [?&]q=([^&#]*)
    

    Or, better yet:

    import urlparse
    pr = urlparse.urlparse(url)
    qs = urlparse.parse_qs(pr.query)['q']
    

    The latter automatically decodes %-escapes, too.