I'm trying to get content from a url and parse the response using BeautyfulSoup.
This url when loaded it retrieves my favourite watchlist items, the problem is that when the site loads it takes a couple of seconds to displays the data in a table, so when I run urlopen(my_url)
the response has no table, therefore my parsing method fails.
I'm trying to keep it simple as I'm learning the language so I would like to use the tools I've already setup in me code so based on what I have I wonder if there is a way to wait, or check when the content is ready for me to be able to fetch the data (table content).
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
from urllib.error import URLError, HTTPError
URL = 'url route goes here' # In compliance to SO rules I've removed the website path
def get_dom_from_url():
try:
u_client = ureq(URL)
html = u_client.read()
u_client.close()
except HTTPError as e:
print(f'There has been an HTTP ERROR: {e.code}')
except URLError as e:
print(f'There has been a problem reaching the URL. ERROR: {e.code}')
finally:
print('''
DOM loaded!
''')
return html
dom = soup(get_dom_from_url(), 'html.parser')
# Crawl the dom object and get the table thead element
col_names = [col.text for col in dom.table.thead.find_all('th')]
col_names = col_names[1:-2]
col_names
This is the error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-102-625de133b2e2> in <module>
----> 1 col_names = [col.text for col in dom.table.thead.find_all('th')]
2 col_names = col_names[1:-2]
3 col_names
AttributeError: 'NoneType' object has no attribute 'thead'
The code above works, when I load the url without the route, but I need it because I need to store the same data for an ETL pipeline I working on.
If there is no way to achieve this using only urllib
I would like to hear your suggestions.
Actually you don't need to use Selenium here. The data is embedded in the source html in the <script>
tags in a valid json format. Just need to parse that:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
url = 'https://coinmarketcap.com/watchlist/60321ee5b01cab343e1e37d6/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('script', {'id':'__NEXT_DATA__'}).text
jsonData = json.loads(jsonStr)
data = jsonData['props']['initialProps']['pageProps']['fetchedWatchlist']['cryptoCurrencies']
rows = []
for each in data:
quotes_row = each.pop('quotes')[0]
each.pop('tags')
if 'platform' in each.keys():
each.pop('platform')
each.update(quotes_row)
rows.append(each)
df = pd.DataFrame(rows)
Output:
print(df.to_string())
id name symbol slug status rank marketPairCount circulatingSupply totalSupply maxSupply lastUpdated dateAdded price volume24h marketCap percentChange1h percentChange24h percentChange7d
0 1 USD BTC bitcoin active 1 9717 1.863544e+07 1.863544e+07 2.100000e+07 2021-02-22T09:37:02.000Z 2013-04-28T00:00:00.000Z 55579.249971 5.656584e+10 1.035744e+12 -1.232746 -1.234765 16.978174
1 1027 USD ETH ethereum active 2 5982 1.147732e+08 1.147732e+08 NaN 2021-02-22T09:37:02.000Z 2015-08-07T00:00:00.000Z 1855.072456 2.450605e+10 2.129125e+11 -1.373583 -4.104364 5.315240
2 1839 USD BNB binance-coin active 3 469 1.545328e+08 1.705328e+08 1.705328e+08 2021-02-22T09:37:11.000Z 2017-07-25T00:00:00.000Z 272.095668 6.811884e+09 4.204770e+10 -2.381284 2.937286 109.533310
3 825 USD USDT tether active 4 10829 3.445054e+10 3.570817e+10 NaN 2021-02-22T09:37:08.000Z 2015-02-25T00:00:00.000Z 0.999576 1.087710e+11 3.443593e+10 -0.061248 -0.023795 -0.074917
4 6636 USD DOT polkadot-new active 5 145 9.103144e+08 1.045967e+09 NaN 2021-02-22T09:36:05.000Z 2020-08-19T00:00:00.000Z 37.503515 3.257901e+09 3.413999e+10 -1.327435 -2.635214 40.263648
5 2010 USD ADA cardano active 6 231 3.111248e+10 4.500000e+10 4.500000e+10 2021-02-22T09:37:09.000Z 2017-10-01T00:00:00.000Z 1.040491 6.621492e+09 3.237226e+10 -1.594681 -7.316003 25.951127
6 52 USD XRP xrp active 7 673 4.540403e+10 9.999083e+10 1.000000e+11 2021-02-22T09:38:03.000Z 2013-08-04T00:00:00.000Z 0.581321 1.102498e+10 2.639430e+10 -1.640063 11.286157 2.731301
7 2 USD LTC litecoin active 8 754 6.653055e+07 6.653055e+07 8.400000e+07 2021-02-22T09:38:02.000Z 2013-04-28T00:00:00.000Z 216.783950 6.530638e+09 1.442276e+10 -2.134667 -3.477237 5.932102
8 1975 USD LINK chainlink active 9 471 4.085096e+08 1.000000e+09 1.000000e+09 2021-02-22T09:37:11.000Z 2017-09-20T00:00:00.000Z 32.145503 1.885830e+09 1.313174e+10 -1.378857 -5.152372 -0.036835
9 1831 USD BCH bitcoin-cash active 10 581 1.866177e+07 1.866177e+07 2.100000e+07 2021-02-22T09:37:07.000Z 2017-07-23T00:00:00.000Z 679.047253 5.800439e+09 1.267222e+10 -1.298651 -0.162108 -1.595937