Search code examples
python-3.xweb-scrapingpython-multithreading

How to run a threaded function that returns a variable?


Working with Python 3.6, what I’m looking to accomplish is to create a function that continuously scrapes dynamic/changing data from a webpage, while the rest of the script executes, and is able to reference the data returned from the continuous function.

I know this is likely a threading task, however I’m not super knowledgeable in it yet. Pseudo-code I might think looks something like this

def continuous_scraper():
    # Pull data from webpage
    scraped_table = pd.read_html(url)
    return scraped_table

# start the continuous scraper function here, to run either indefinitely, or preferably stop after a predefined amount of time
scraped_table = thread(continuous_scraper)

# the rest of the script is run here, making use of the updating “scraped_table”
while True:
    print(scraped_table[“Col_1”].iloc[0]

Solution

  • Here is a fairly simple example using some stock market page that seems to update every couple of seconds.

    import threading, time
    
    import pandas as pd
    
    # A lock is used to ensure only one thread reads or writes the variable at any one time
    scraped_table_lock = threading.Lock()
    
    # Initially set to None so we know when its value has changed
    scraped_table = None
    
    # This bad-boy will be called only once in a separate thread
    def continuous_scraper():
        # Tell Python this is a global variable, so it rebinds scraped_table 
        # instead of creating a local variable that is also named scraped_table
        global scraped_table
        url = r"https://tradingeconomics.com/australia/stock-market"
        while True:
            # Pull data from webpage
            result = pd.read_html(url, match="Dow Jones")[0]
            
            # Acquire the lock to ensure thread-safety, then assign the new result
            # This is done after read_html returns so it doesn't hold the lock for so long
            with scraped_table_lock:
                scraped_table = result
            
            # You don't wanna flog the server, so wait 2 seconds after each 
            # response before sending another request
            time.sleep(2)
    
    # Make the thread daemonic, so the thread doesn't continue to run once the 
    # main script and any other non-daemonic threads have ended
    scraper_thread = threading.Thread(target=continuous_scraper, daemon=True)
    
    # start the continuous scraper function here, to run either indefinitely, or 
    # preferably stop after a predefined amount of time
    scraper_thread.start()
    
    # the rest of the script is run here, making use of the updating “scraped_table”
    for _ in range(100):
        print("Time:", time.time())
        
        # Acquire the lock to ensure thread-safety
        with scraped_table_lock:
            # Check if it has been changed from the default value of None
            if scraped_table is not None:
                print("     ", scraped_table)
            else:
                print("scraped_table is None")
        
        # You probably don't wanna flog your stdout, either, dawg!
        time.sleep(0.5)
    

    Be sure to read about multithreaded programming and thread safety. It's easy to make mistakes. If there is a bug, it often only manifests in rare and seemingly random occasions, making it difficult to debug.