Search code examples
pythongoogle-sheetsweb-scraping

Scrape Embedded Google Sheet from HTML in Python


This one has been relatively tricky for me. I am trying to extract the embedded table sourced from google sheets in python.

Here is the link

I do not own the sheet but it is publicly available.

here is my code thus far, when I go to output the headers it is showing me "". Any help would be greatly appreciated. End goal is to convert this table into a pandas DF. Thanks guys

import lxml.html as lh
import pandas as pd

url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'

page = requests.get(url)

doc = lh.fromstring(page.content)

tr_elements = doc.xpath('//tr')

col = []
i = 0

for t in tr_elements[0]:
    i +=1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[])) 

Solution

  • Well if you would like to get the data into a DataFrame, you could load it directly:

    df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727', 
                      header=1)[0]
    df.drop(columns='1', inplace=True)  # remove unnecessary index column called "1"
    

    This will give you:

                                   Target Ticker                   Acquirer  \
    0       Acacia Communications Inc Com   ACIA      Cisco Systems Inc Com   
    1  Advanced Disposal Services Inc Com   ADSW   Waste Management Inc Com   
    2                    Allergan Plc Com    AGN             Abbvie Inc Com   
    3           Ak Steel Holding Corp Com    AKS   Cleveland Cliffs Inc Com   
    4      Td Ameritrade Holding Corp Com   AMTD  Schwab (Charles) Corp Com   
    
      Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced  \
    0     CSCO        $68.79          $70.00      $1.21  1.76%       7/9/2019   
    1       WM        $32.93          $33.15      $0.22  0.67%      4/15/2019   
    2     ABBV       $197.05         $200.22      $3.17  1.61%      6/25/2019   
    3      CLF         $2.98           $3.02      $0.04  1.34%      12/3/2019   
    4     SCHW        $49.31          $51.27      $1.96  3.97%     11/25/2019   
    
      Deal Type  
    0      Cash  
    1      Cash  
    2       C&S  
    3     Stock  
    4     Stock  
    

    Note read_html returns a list. In this case there is only 1 DataFrame, so we can refer to the first and only index location [0]