Search code examples
pythonjsonloopscsvkeyerror

Loop not working properly and key output wrong


Ok, I'm going to start showing my code:

import requests
import json
import csv
import pandas as pd

with open('AcoesURLJsonCompleta.csv', newline='') as csvfile:
    urlreader = csv.reader(csvfile, delimiter=',')
    for obj_id in urlreader:

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
jsonData = requests.get(row, headers=headers).json()

mapper = (
    ('Ticker', 'ric'),
    ('Beta', 'beta'),
    ('DY', 'current_dividend_yield_ttm'),
    ('VOL', 'share_volume_3m'),
    ('P/L', 'pe_normalized_annual'),
    ('Cresc5A', 'eps_growth_5y'),
    ('LPA', 'eps_normalized_annual'),
    ('VPA', 'book_value_share_quarterly'),
    ('LAST', 'last')
)

data = {}
for dataKey, jsonDataKey in mapper: 
    d = jsonData.get(jsonDataKey, '') 
    try:
        flt_d = float(d)
    except ValueError:
        d = ''
    finally:
        data[dataKey] = [d]

table = pd.DataFrame(data, columns=['Ticker', 'Beta', 'DY', 'VOL', 'P/L', 'Cresc5A', 'LPA', 'VPA', 'Last'])
table.index = table.index + 1
table.to_csv('CompleteData.csv', sep=',', encoding='utf-8', index=False)
print(table)

Ok so lets get started with:

  1. My first loop for rows in Urls is correct? I want to loop through my Urls stored in my CSV file, but I don't know if I'm using split and strip correctly.
  2. My json requests are ok?
  3. If any of that jsonData requests return a NaN or Null or doesn't find anything, how should I put that on my code so it will skip to the other URL and append "" (nothing) when this happens?

The output for this entire code is line 25, in <module> Beta = jsonData['beta'] KeyError: 'beta'

Thank you


Solution

  • Updated Code

    I have taken the few lines of URL you have supplied and ran the following code against it and printed out the results. This version uses multiple threads to fetch the URLs and a requests Session. This greatly speeds up processing.

    There is a constant near the top of the code, NUMBER_OF_CONCURRENT_URL_REQUESTS, that determines the number of concurrent URL get requests that will be made. I have tried various numbers from 8 to 30. This is what I have learned (or appears to be true):

    1. Regardless of of the setting for NUMBER_OF_CONCURRENT_URL_REQUESTS, if you run the program twice quickly in succession, you get the same results. It would appear that the server is caching request results for some period of time.
    2. However, if you wait long enough so that the cache does not come into play, you get different results, i.e. different errors as far as data being missing. Why this is I cannot say.
    3. The larger the value of NUMBER_OF_CONCURRENT_URL_REQUESTS, the faster the program runs. There may be some value that is so large that the server might get upset and think that you are trying to commit a denial of service attack. I don't see any reason to make this value larger than 30.
    4. Is there a correlation between the larger the value of NUMBER_OF_CONCURRENT_URL_REQUESTS and the likelihood of having missing data? I cannot say for certain, but that appears to be the case, and it makes no sense to me. You can try different values and see for yourself one way or another.

    The code:

    import csv, requests, pandas as pd
    from decimal import Decimal, DecimalException
    from collections import defaultdict
    from concurrent.futures import ThreadPoolExecutor
    from functools import partial
    from time import sleep
    
    NUMBER_OF_CONCURRENT_URL_REQUESTS = 8
    
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
    
    def request_getter(session, url):
        ric = url.split('/')[-1] # in case results does not contain 'ric' key
        for t in (0, 1000, 2000, 4000, 4000):
            if t:
                sleep(t)
                print(f"Retrying request '{ric}' ...", flush=True)
            data = session.get(url, headers=headers).json()
            if 'retry' not in data:
                break
        return ric, data
    
    
    mapper = (
        ('Ticker', 'ric'),
        ('Beta', 'beta'),
        ('DY', 'current_dividend_yield_ttm'),
        ('VOL', 'share_volume_3m'),
        ('P/L', 'pe_normalized_annual'),
        ('Cresc5A', 'eps_growth_5y'),
        ('LPA', 'eps_normalized_annual'),
        ('VPA', 'book_value_share_quarterly'),
        ('LAST', 'last')
    )
    
    data = defaultdict(list)
    with open('AcoesURLJsonCompleta.csv', newline='') as csvfile:
        urlreader = csv.reader(csvfile, delimiter=',')
        # set max_workers to # cpu processors you have and use a requests Session for even more perofrmance
        with ThreadPoolExecutor(max_workers=NUMBER_OF_CONCURRENT_URL_REQUESTS) as executor, requests.Session() as session:
            request_getter_with_session = partial(request_getter, session)
            for ric, results in executor.map(request_getter_with_session, (row[0] for row in urlreader)):
                if 'market_data' not in results:
                    print(f"Missing 'market_data' key for request '{ric}'", flush=True)
                    for k, v in results.items():
                        print(f'    {repr(k)} -> {repr(v)}', flush=True)
                    print(flush=True)
                    continue
                market_data = results['market_data']
                if 'ric' not in market_data:
                    # see if any of the mapper keys are present:
                    found = False
                    for _, jsonDataKey in mapper:
                        if jsonDataKey in market_data:
                            found = True
                            break
                    if not found:
                        print(f"Request '{ric}' has nothing recognizable in market_data:", flush=True)
                        for k, v in market_data.items():
                            print(f'    {repr(k)} -> {repr(v)}', flush=True)
                        print(flush=True)
                        continue
                    # We have at least one data value present
                    print(f"Results missing 'ric' key; inferring 'ric' value '{ric}' from request URL.", flush=True)
                    market_data['ric'] = ric
                for dataKey, jsonDataKey in mapper: # for example, 'Ticker', 'ric'
                    d = market_data.get(jsonDataKey)
                    if d is None:
                        print(f"Data missing for request = '{ric}', key = '{jsonDataKey}'", flush=True)
                        d = '' if jsonDataKey == 'ric' else Decimal('NaN')
                    else:
                        try:
                            if jsonDataKey != 'ric': d = Decimal(d)
                        except DecimalException:
                            print(f"Bad value for '{jsonDataKey}': {repr(d)}", flush=True)
                            d = Decimal('NaN') # Decimal class has it's own version
                    data[dataKey].append(d) # add to data
    
    table = pd.DataFrame(data)
    table.index = table.index + 1
    table.to_csv('CompleteData.csv', sep=',', encoding='utf-8', index=False)
    print(table)
    """
    # to read back table:
    table2 = pd.read_csv('CompleteData.csv', sep=',', encoding='utf-8', converters={
        'Ticker': str,
        'Beta': Decimal,
        'DY': Decimal,
        'VOL': Decimal,
        'P/L': Decimal,
        'Cresc5A': Decimal,
        'LPA': Decimal,
        'VPA': Decimal,
        'LAST': Decimal
    })
    print(table2)
    """
    

    Prints:

    Missing 'market_data' key for request CPLE6.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request EQMA3B.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Data missing for ric GNDI3.sa, key beta
    Data missing for ric GNDI3.sa, key current_dividend_yield_ttm
    Data missing for ric GNDI3.sa, key share_volume_3m
    Data missing for ric GNDI3.sa, key pe_normalized_annual
    Data missing for ric GNDI3.sa, key eps_growth_5y
    Data missing for ric GNDI3.sa, key eps_normalized_annual
    Data missing for ric GNDI3.sa, key book_value_share_quarterly
    Missing 'market_data' key for request MDNE3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request MMXM11.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request PCAR3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Results missing ric key; inferring ric value from request URL.
    Data missing for ric RAIL3.sa, key last
    Results missing ric key; inferring ric value from request URL.
    Data missing for ric SANB4.sa, key last
    Missing 'market_data' key for request TIMP3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request VIVT3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
           Ticker     Beta        DY        VOL           P/L       Cresc5A       LPA       VPA       LAST
    1    AALR3.sa  1.04339   0.80591   11.00223      26.44449  -99999.99000   0.39668  10.83966  10.490000
    2    ABCB4.sa  1.20526   7.34780   18.61900       5.78866       5.42894   2.46862  18.87782  14.290000
    3    ABEV3.sa  0.46311   4.32628  688.21043      15.04597      -0.71223   0.75369   3.89563  11.340000
    4    ADHM3.sa  1.69780   0.00000    2.36460  -99999.99000  -99999.99000  -0.65331  -2.61497   2.480000
    5    AGRO3.sa  0.35568   4.53332    2.54323      41.17127  -99999.99000   0.49792  17.47838  20.500000
    ..        ...      ...       ...        ...           ...           ...       ...       ...        ...
    255  WEGE3.sa  0.50580   1.02429  165.72543      50.11481      17.06485   0.79697   4.59658  39.940000
    256  WHRL3.sa  0.59263   8.86991    1.24990      12.72584       0.65648   0.50920   2.00868   6.700000
    257  WHRL4.sa  0.59263   8.86991    1.24990      12.72584       0.65648   0.50920   2.00868   6.480000
    258  WIZS3.sa  0.76719  12.18673   19.00407       6.67135      21.23109   1.36704   1.16978   9.120000
    259  YDUQ3.sa  1.42218   1.68099   94.00410      13.83419       9.13751   2.19384  10.31845  30.350000
    
    [259 rows x 9 columns]
    

    Next run:

    Missing 'market_data' key for request CPLE6.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request EQMA3B.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request MDNE3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request MMXM11.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request PCAR3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request TIMP3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
    Missing 'market_data' key for request VIVT3.sa
    status -> {}
    message -> service returned code:
    rcom_service_message -> None
    
           Ticker     Beta        DY        VOL           P/L       Cresc5A       LPA       VPA       LAST
    1    AALR3.sa  1.04339   0.80591   11.00223      26.44449  -99999.99000   0.39668  10.83966  10.490000
    2    ABCB4.sa  1.20526   7.34780   18.61900       5.78866       5.42894   2.46862  18.87782  14.290000
    3    ABEV3.sa  0.46311   4.32628  688.21043      15.04597      -0.71223   0.75369   3.89563  11.340000
    4    ADHM3.sa  1.69780   0.00000    2.36460  -99999.99000  -99999.99000  -0.65331  -2.61497   2.480000
    5    AGRO3.sa  0.35568   4.53332    2.54323      41.17127  -99999.99000   0.49792  17.47838  20.500000
    ..        ...      ...       ...        ...           ...           ...       ...       ...        ...
    255  WEGE3.sa  0.50580   1.02429  165.72543      50.11481      17.06485   0.79697   4.59658  39.940000
    256  WHRL3.sa  0.59263   8.86991    1.24990      12.72584       0.65648   0.50920   2.00868   6.700000
    257  WHRL4.sa  0.59263   8.86991    1.24990      12.72584       0.65648   0.50920   2.00868   6.480000
    258  WIZS3.sa  0.76719  12.18673   19.00407       6.67135      21.23109   1.36704   1.16978   9.120000
    259  YDUQ3.sa  1.42218   1.68099   94.00410      13.83419       9.13751   2.19384  10.31845  30.350000
    
    [259 rows x 9 columns]
    

    Discussion

    The code has been made more complex by using threads and a requests Session object, but the complexity is necessary to greatly reduce the running time of the program.

    To understand the code, you need to understand the ThreadPoolExecutor, the map function (ThreadPoolExcecutor.map method is a variation on this that assigns a thread to execute the function call) and functools.partial, which is required since map expects that its function argument is a function that takes a single argument but we need to call request_getter with two arguments, a requests Session object, which never varies, and a URL. partial allows us to transform a function that takes two arguments into a function that takes one argument with the other argument automatically provided. For example:

    def foo(x, y):
        return x + y
    
    def foo7(y):
        return partial(foo, 7) # the first argument to foo now will always be 7
    
    foo7(9) # equivalent to foo(7, 9)
    

    To read back csv file:

    from decimal import Decimal
    import pandas as pd
    
    table = pd.read_csv('CompleteData.csv', sep=',', encoding='utf-8', converters={
        'Ticker': str,
        'Beta': Decimal,
        'DY': Decimal,
        'VOL': Decimal,
        'P/L': Decimal,
        'Cresc5A': Decimal,
        'LPA': Decimal,
        'VPA': Decimal,
        'LAST': Decimal    
    })