Ok, I'm going to start showing my code:
import requests
import json
import csv
import pandas as pd
with open('AcoesURLJsonCompleta.csv', newline='') as csvfile:
urlreader = csv.reader(csvfile, delimiter=',')
for obj_id in urlreader:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
jsonData = requests.get(row, headers=headers).json()
mapper = (
('Ticker', 'ric'),
('Beta', 'beta'),
('DY', 'current_dividend_yield_ttm'),
('VOL', 'share_volume_3m'),
('P/L', 'pe_normalized_annual'),
('Cresc5A', 'eps_growth_5y'),
('LPA', 'eps_normalized_annual'),
('VPA', 'book_value_share_quarterly'),
('LAST', 'last')
)
data = {}
for dataKey, jsonDataKey in mapper:
d = jsonData.get(jsonDataKey, '')
try:
flt_d = float(d)
except ValueError:
d = ''
finally:
data[dataKey] = [d]
table = pd.DataFrame(data, columns=['Ticker', 'Beta', 'DY', 'VOL', 'P/L', 'Cresc5A', 'LPA', 'VPA', 'Last'])
table.index = table.index + 1
table.to_csv('CompleteData.csv', sep=',', encoding='utf-8', index=False)
print(table)
Ok so lets get started with:
for rows in Urls
is correct? I want to loop through my Urls stored in my CSV file, but I don't know if I'm using split and strip correctly.jsonData
requests return a NaN or Null or doesn't find anything, how should I put that on my code so it will skip to the other URL and append "" (nothing) when this happens?The output for this entire code is line 25, in <module>
Beta = jsonData['beta']
KeyError: 'beta'
Thank you
Updated Code
I have taken the few lines of URL you have supplied and ran the following code against it and printed out the results. This version uses multiple threads to fetch the URLs and a requests
Session. This greatly speeds up processing.
There is a constant near the top of the code, NUMBER_OF_CONCURRENT_URL_REQUESTS
, that determines the number of concurrent URL get requests that will be made. I have tried various numbers from 8 to 30. This is what I have learned (or appears to be true):
NUMBER_OF_CONCURRENT_URL_REQUESTS
, if you run the program twice quickly in succession, you get the same results. It would appear that the server is caching request results for some period of time.NUMBER_OF_CONCURRENT_URL_REQUESTS
, the faster the program runs. There may be some value that is so large that the server might get upset and think that you are trying to commit a denial of service attack. I don't see any reason to make this value larger than 30.NUMBER_OF_CONCURRENT_URL_REQUESTS
and the likelihood of having missing data? I cannot say for certain, but that appears to be the case, and it makes no sense to me. You can try different values and see for yourself one way or another.The code:
import csv, requests, pandas as pd
from decimal import Decimal, DecimalException
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from functools import partial
from time import sleep
NUMBER_OF_CONCURRENT_URL_REQUESTS = 8
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
def request_getter(session, url):
ric = url.split('/')[-1] # in case results does not contain 'ric' key
for t in (0, 1000, 2000, 4000, 4000):
if t:
sleep(t)
print(f"Retrying request '{ric}' ...", flush=True)
data = session.get(url, headers=headers).json()
if 'retry' not in data:
break
return ric, data
mapper = (
('Ticker', 'ric'),
('Beta', 'beta'),
('DY', 'current_dividend_yield_ttm'),
('VOL', 'share_volume_3m'),
('P/L', 'pe_normalized_annual'),
('Cresc5A', 'eps_growth_5y'),
('LPA', 'eps_normalized_annual'),
('VPA', 'book_value_share_quarterly'),
('LAST', 'last')
)
data = defaultdict(list)
with open('AcoesURLJsonCompleta.csv', newline='') as csvfile:
urlreader = csv.reader(csvfile, delimiter=',')
# set max_workers to # cpu processors you have and use a requests Session for even more perofrmance
with ThreadPoolExecutor(max_workers=NUMBER_OF_CONCURRENT_URL_REQUESTS) as executor, requests.Session() as session:
request_getter_with_session = partial(request_getter, session)
for ric, results in executor.map(request_getter_with_session, (row[0] for row in urlreader)):
if 'market_data' not in results:
print(f"Missing 'market_data' key for request '{ric}'", flush=True)
for k, v in results.items():
print(f' {repr(k)} -> {repr(v)}', flush=True)
print(flush=True)
continue
market_data = results['market_data']
if 'ric' not in market_data:
# see if any of the mapper keys are present:
found = False
for _, jsonDataKey in mapper:
if jsonDataKey in market_data:
found = True
break
if not found:
print(f"Request '{ric}' has nothing recognizable in market_data:", flush=True)
for k, v in market_data.items():
print(f' {repr(k)} -> {repr(v)}', flush=True)
print(flush=True)
continue
# We have at least one data value present
print(f"Results missing 'ric' key; inferring 'ric' value '{ric}' from request URL.", flush=True)
market_data['ric'] = ric
for dataKey, jsonDataKey in mapper: # for example, 'Ticker', 'ric'
d = market_data.get(jsonDataKey)
if d is None:
print(f"Data missing for request = '{ric}', key = '{jsonDataKey}'", flush=True)
d = '' if jsonDataKey == 'ric' else Decimal('NaN')
else:
try:
if jsonDataKey != 'ric': d = Decimal(d)
except DecimalException:
print(f"Bad value for '{jsonDataKey}': {repr(d)}", flush=True)
d = Decimal('NaN') # Decimal class has it's own version
data[dataKey].append(d) # add to data
table = pd.DataFrame(data)
table.index = table.index + 1
table.to_csv('CompleteData.csv', sep=',', encoding='utf-8', index=False)
print(table)
"""
# to read back table:
table2 = pd.read_csv('CompleteData.csv', sep=',', encoding='utf-8', converters={
'Ticker': str,
'Beta': Decimal,
'DY': Decimal,
'VOL': Decimal,
'P/L': Decimal,
'Cresc5A': Decimal,
'LPA': Decimal,
'VPA': Decimal,
'LAST': Decimal
})
print(table2)
"""
Prints:
Missing 'market_data' key for request CPLE6.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request EQMA3B.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Data missing for ric GNDI3.sa, key beta
Data missing for ric GNDI3.sa, key current_dividend_yield_ttm
Data missing for ric GNDI3.sa, key share_volume_3m
Data missing for ric GNDI3.sa, key pe_normalized_annual
Data missing for ric GNDI3.sa, key eps_growth_5y
Data missing for ric GNDI3.sa, key eps_normalized_annual
Data missing for ric GNDI3.sa, key book_value_share_quarterly
Missing 'market_data' key for request MDNE3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request MMXM11.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request PCAR3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Results missing ric key; inferring ric value from request URL.
Data missing for ric RAIL3.sa, key last
Results missing ric key; inferring ric value from request URL.
Data missing for ric SANB4.sa, key last
Missing 'market_data' key for request TIMP3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request VIVT3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Ticker Beta DY VOL P/L Cresc5A LPA VPA LAST
1 AALR3.sa 1.04339 0.80591 11.00223 26.44449 -99999.99000 0.39668 10.83966 10.490000
2 ABCB4.sa 1.20526 7.34780 18.61900 5.78866 5.42894 2.46862 18.87782 14.290000
3 ABEV3.sa 0.46311 4.32628 688.21043 15.04597 -0.71223 0.75369 3.89563 11.340000
4 ADHM3.sa 1.69780 0.00000 2.36460 -99999.99000 -99999.99000 -0.65331 -2.61497 2.480000
5 AGRO3.sa 0.35568 4.53332 2.54323 41.17127 -99999.99000 0.49792 17.47838 20.500000
.. ... ... ... ... ... ... ... ... ...
255 WEGE3.sa 0.50580 1.02429 165.72543 50.11481 17.06485 0.79697 4.59658 39.940000
256 WHRL3.sa 0.59263 8.86991 1.24990 12.72584 0.65648 0.50920 2.00868 6.700000
257 WHRL4.sa 0.59263 8.86991 1.24990 12.72584 0.65648 0.50920 2.00868 6.480000
258 WIZS3.sa 0.76719 12.18673 19.00407 6.67135 21.23109 1.36704 1.16978 9.120000
259 YDUQ3.sa 1.42218 1.68099 94.00410 13.83419 9.13751 2.19384 10.31845 30.350000
[259 rows x 9 columns]
Next run:
Missing 'market_data' key for request CPLE6.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request EQMA3B.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request MDNE3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request MMXM11.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request PCAR3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request TIMP3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Missing 'market_data' key for request VIVT3.sa
status -> {}
message -> service returned code:
rcom_service_message -> None
Ticker Beta DY VOL P/L Cresc5A LPA VPA LAST
1 AALR3.sa 1.04339 0.80591 11.00223 26.44449 -99999.99000 0.39668 10.83966 10.490000
2 ABCB4.sa 1.20526 7.34780 18.61900 5.78866 5.42894 2.46862 18.87782 14.290000
3 ABEV3.sa 0.46311 4.32628 688.21043 15.04597 -0.71223 0.75369 3.89563 11.340000
4 ADHM3.sa 1.69780 0.00000 2.36460 -99999.99000 -99999.99000 -0.65331 -2.61497 2.480000
5 AGRO3.sa 0.35568 4.53332 2.54323 41.17127 -99999.99000 0.49792 17.47838 20.500000
.. ... ... ... ... ... ... ... ... ...
255 WEGE3.sa 0.50580 1.02429 165.72543 50.11481 17.06485 0.79697 4.59658 39.940000
256 WHRL3.sa 0.59263 8.86991 1.24990 12.72584 0.65648 0.50920 2.00868 6.700000
257 WHRL4.sa 0.59263 8.86991 1.24990 12.72584 0.65648 0.50920 2.00868 6.480000
258 WIZS3.sa 0.76719 12.18673 19.00407 6.67135 21.23109 1.36704 1.16978 9.120000
259 YDUQ3.sa 1.42218 1.68099 94.00410 13.83419 9.13751 2.19384 10.31845 30.350000
[259 rows x 9 columns]
Discussion
The code has been made more complex by using threads and a requests Session object, but the complexity is necessary to greatly reduce the running time of the program.
To understand the code, you need to understand the ThreadPoolExecutor
, the map
function (ThreadPoolExcecutor.map
method is a variation on this that assigns a thread to execute the function call) and functools.partial
, which is required since map
expects that its function argument is a function that takes a single argument but we need to call request_getter
with two arguments, a requests
Session object, which never varies, and a URL. partial
allows us to transform a function that takes two arguments into a function that takes one argument with the other argument automatically provided. For example:
def foo(x, y):
return x + y
def foo7(y):
return partial(foo, 7) # the first argument to foo now will always be 7
foo7(9) # equivalent to foo(7, 9)
To read back csv file:
from decimal import Decimal
import pandas as pd
table = pd.read_csv('CompleteData.csv', sep=',', encoding='utf-8', converters={
'Ticker': str,
'Beta': Decimal,
'DY': Decimal,
'VOL': Decimal,
'P/L': Decimal,
'Cresc5A': Decimal,
'LPA': Decimal,
'VPA': Decimal,
'LAST': Decimal
})