Search code examples
pythonselenium-webdrivergoogle-sheetsweb-scrapingpython-requests

How to get data from public google sheet and more than 100 rows


im currently trying to scrape a google sheet. But I have serious troubles because I cant just use requests to get the page data because only the first 100 rows are loaded. So I tried to use selenium but google has pretty strong anti scrape techniques. I cant get good data out of it. If someone knows a different aproach or a way to fix the selenium method let me know.


Solution

  • You can transform the URL of the google sheet (to have export in Json format). Then use json/pandas to load the data into a Dataframe:

    import json
    import requests
    import pandas as pd
    
    
    # transform the URL to this:
    url = 'https://docs.google.com/spreadsheets/d/1DAQ__r9RNx4mHgcDoXvzda9-0d_yeP7hqXdRUZ67EtM/gviz/tq?tqx=out:json'
    
    data = requests.get(url).text
    data = data[data.index('{'):-2]
    data = json.loads(data)['table']
    # print(json.dumps(data['table'], indent=4))
    
    df = pd.DataFrame([[c['v'] if c else None for c in r['c']] for r in data['rows']], columns=None)[range(4)]
    df.columns = ['Item', 'Link', 'Price (Yuan)', 'Price (USD)']
    
    print(df.head())
    print()
    print(len(df))
    

    Prints:

                                  Item                                               Link Price (Yuan) Price (USD)
    0               NIKE VAPORMAX 2021  https://pandabuy.allapp.link/clvm6i8gpf6qoog017eg         ¥178         $27
    1  OFF WHITE DUNKS\n(29 COLORWAYS)  https://pandabuy.allapp.link/clvm6k0gpf6qoog017fg         ¥160         $25
    2  OFF WHITE DUNKS\nUNIVERSITY RED  https://pandabuy.allapp.link/clvm6nggpf6qoog017gg         ¥400         $63
    3      OFF WHITE DUNKS\nPINE GREEN  https://pandabuy.allapp.link/clvm6q7m8nglvnuo8urg         ¥400         $63
    4           OFF WHITE LOT 50 DUNKS  https://pandabuy.allapp.link/clvm6s90b4mhqi3p2010         ¥180         $28
    
    1319
    

    EDIT: To load images - use and load the sheet from preview URL

    import requests
    from io import StringIO
    from bs4 import BeautifulSoup
    
    import pandas as pd
    
    
    url = 'https://docs.google.com/spreadsheets/d/1DAQ__r9RNx4mHgcDoXvzda9-0d_yeP7hqXdRUZ67EtM/preview/sheet?gid=0'
    
    table = BeautifulSoup(requests.get(url).content, 'html.parser').table
    
    # convert all img links to text
    for img in table.select('td img[src]'):
        img.replace_with(img['src'])
    
    df = pd.read_html(StringIO(str(table)))[0]
    print(df.head(10))
    print(len(df))
    

    Prints:

       Unnamed: 0                          Unnamed: 1                                         Unnamed: 2                          Unnamed: 3                          Unnamed: 4                                                                                                                                                                                                                                      Unnamed: 5
    0           1                              Tiktok                                      Branded Goods        CTRL + F to browse for items        CTRL + F to browse for items                                                                                                                                                                                                                    CTRL + F to browse for items
    1           2  Sign up to be a pandabuy affiliate                 Sign up to be a pandabuy affiliate  Sign up to be a pandabuy affiliate  Sign up to be a pandabuy affiliate                                                                                                                                                                                                              Sign up to be a pandabuy affiliate
    2           3                               SHOES                                              SHOES                               SHOES                               SHOES                                                                                                                                                                                                                                           SHOES
    3           4                                ITEM                                               LINK                    PRICE (¥) (YUAN)                     PRICE ($) (USD)                                                                                                                                                                                                                                         PICTURE
    4           5             JORDAN 4 KX/KZ/K2 Batch  https://pandabuy.allapp.link/clvlhrggpf6qoog012sg                                ¥280                                 $41   https://lh7-rt.googleusercontent.com/sheetsz/AHOq17EttQoQ-f8s57IHnPjcgralFtsUA4x23UTYLye_lTx1K4Tfkwi3A2sg8KKrFlVZaQIxLcjOouPJn5FbrVB5NGe7k88hgYlk1WPuk-ZPIFocGJRQQMI7bknBYzTryoIVOgc-k0qMyYBVWf7hmhMiV4s=w243-h128?key=kbwyiYmsDs8YaHncN15rsw
    5           6      Best JORDAN 4 BLACK CATS Batch  https://pandabuy.allapp.link/clvlhvggpf6qoog012t0                                ¥450                                 $67   https://lh7-rt.googleusercontent.com/sheetsz/AHOq17FEusJVTfIr5gAv8gMh3BJEsHN6zjjJaYejr_bWNBJMUyOZcaCUhUm2dQNS2EM9--8dZ8gHaKyGj0G3k4hAwSkba21x9oO1u_QiOXnGi7o94u-VQD2-PkjhS66MFjpBsKn4mh5v74m53-9K3c1a-cM=w243-h128?key=kbwyiYmsDs8YaHncN15rsw
    6           7                        ARCHEO DUNKS  https://pandabuy.allapp.link/clvlihggpf6qoog01340                                ¥351                                 $53  https://lh7-rt.googleusercontent.com/sheetsz/AHOq17H-O6FAFrpIO6CKerzyeArQYyOXdUdeRO_CddZpmgbAbaHJfPmtzf64NheeENxvAf2mL1m88N3BF7jnPn_4POuNUXHCvkHkZFeHPqbiZSDvKv-j8Ch4e7hpCkzKGuOzyst_J9I1RdSItfoSJqhhNCvP=w243-h129?key=kbwyiYmsDs8YaHncN15rsw
    7           8                        BRAZIL DUNKS  https://pandabuy.allapp.link/clvljbnm8nglvnuo8r60                                ¥351                                 $53   https://lh7-rt.googleusercontent.com/sheetsz/AHOq17FI1GveGB7zouQhT0bo1OWrbEB26LDxsPMhfgF8zXXcZ_yGY1r3POWXs2oRFbFrKiyVSlUw9nLOC7D_WXnnaEigos-rqVIrgBhv9hymIXMf04ZOoZFTIbUD-G995q1bU9F9NzUzUDDDoc2HjRn8mF4=w243-h129?key=kbwyiYmsDs8YaHncN15rsw
    8           9              DARK MARINA BLUE DUNKS  https://pandabuy.allapp.link/clvljgggpf6qoog013e0                                ¥351                                 $53  https://lh7-rt.googleusercontent.com/sheetsz/AHOq17FpjYzZmQsX04TQk4Sb6ohGafMgEahkwa4NCgVGadHqulSaWxp5iW2MC3eI0mA0JSQq0xDgO8mpq1lymma_D3h-Ma6y8vUxsjIc8_n_RhIhwKCdOl7j-7WDBvYSxoZVax4Y99KPEy95gHy89rqhKXCb=w243-h129?key=kbwyiYmsDs8YaHncN15rsw
    9          10               UNIVERSITY BLUE DUNKS  https://pandabuy.allapp.link/clvlna8gpf6qoog01480                                ¥180                                 $26  https://lh7-rt.googleusercontent.com/sheetsz/AHOq17GDYMrm21e0rl4AOLz0204Wd80Xj5bjTfjwa1eHosvQRveP7thUWfs9qc2k95Mpln7Qghju1_uNR_2hLmOAa9EHPacupB0tKWKXy1nH7IFdV7uRcN5bVHSusMNfkMZz2KAq9xAiJhGRY_lfGeK4OqMf=w243-h129?key=kbwyiYmsDs8YaHncN15rsw
    1471