This one has been relatively tricky for me. I am trying to extract the embedded table sourced from google sheets in python.
Here is the link
I do not own the sheet but it is publicly available.
here is my code thus far, when I go to output the headers it is showing me "". Any help would be greatly appreciated. End goal is to convert this table into a pandas DF. Thanks guys
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
Well if you would like to get the data into a DataFrame, you could load it directly:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
This will give you:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO $68.79 $70.00 $1.21 1.76% 7/9/2019
1 WM $32.93 $33.15 $0.22 0.67% 4/15/2019
2 ABBV $197.05 $200.22 $3.17 1.61% 6/25/2019
3 CLF $2.98 $3.02 $0.04 1.34% 12/3/2019
4 SCHW $49.31 $51.27 $1.96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
Note read_html
returns a list. In this case there is only
1 DataFrame, so we can refer to the first and only index location [0]