Search code examples
pythongoogle-scholar

Scrape Google Scholar Security Page


I have a string like this:

url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'

I wish to convert it to this:

converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'

I have tried this:

converted_url = url.decode('utf-8')

However, this error is thrown:

AttributeError: 'str' object has no attribute 'decode'

Solution

  • decode is used to convert bytes into string. And your url is string, not bytes.

    You can use encode to convert this string into bytes and later use decode to convert to correct string.

    (I use prefix r to simulate text with this problem - without prefix url doesn't have to be converted)

    url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
    print(url)
    
    url = url.encode('utf-8').decode('unicode_escape')
    print(url)
    

    result:

    http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10
    
    http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10
    

    BTW: first check print(url) maybe you have correct url but you use wrong method to display it. Python Shell displays all result without print() using print(repr()) which display some chars as code to show what endcoding is used in text (utf-8, iso-8859-1, win-1250, latin-1, etc.)