Search code examples
pythonscrapypython-asynciopipeline

How to stop and exit an async script called by exec() function?


I have this file, let's call it bs4_scraper.py. Giving context:

  • The scrap function is just an async function that makes asynchronous requests to the website.
  • The get_pids_from_file_generator is just a function that reads a .txt file, adds each line (the pid) to a Generator, and returns it.
async def bs4_scraper():
    limit = Semaphore(8)
    tasks = []
    pids = get_pids_from_file_generator()
    for pid in pids:
        task = create_task(scrap(pid, fake_header(), limit))
        tasks.append(task)
    result = await gather(*tasks)
    return result


if __name__ == "__main__":
    try:
        run(bs4_scraper())
    except Exception as e:
        logger.error(e)

When I run this function in the terminal using python bs4_scraper.py the function runs and and exits gracefully when all requests are done. No problem to this point (I think so).

Now I have this separate file, which is a Scrapy pipeline that runs at the end of the scraping process:

class WritePidErrorsPipeline:
    def close_spider(self, spider):
        pid_errors_file = generate_pid_errors_file()
        pg = PostgresDB()
        non_inserted_ids = pg.select_non_inserted_ids(pid_errors_file)
        if non_inserted_ids:
            self.insertion_errors_file(non_inserted_ids)
            bs4_file = os.path.abspath("bs4/bs4_scraper.py")
            exec(open(bs4_file).read()) # THE PROBLEM IS RIGHT HERE
        else:
            logger.info("[SUCCESS]: There are no items missing")

    def insertion_errors_file(
        self,
        non_inserted_ids: List[Tuple[str]],
        output_file: str = "insertion_errors.log",
    ) -> str:
        with open(output_file, "w", encoding="utf-8") as f:
            for non_inserted_id in non_inserted_ids:
                f.write(f"{non_inserted_id[0]}\n")
        return output_file

The problem occurs at the line exec(open(bs4_file).read()). The file is called and the function runs properly, but when it is done, it does not exit, and keeps running after the last well-succeeded request. Looks like a zombie process, I don't have any idea why this happens.

How do I improve this code to run as expected?

PS: sorry for any English mistake


Solution

  • Are you sure the file actually runs,a nd hangs after it finishes? because an obvious problem there is the guard if __name__ == "__main__": at the end of your file: this is a code pattern meant to ensure the gaurded part will only run when that file, the file containing the line if __name__ == "__main__": is the main file called by Python.

    When running scrappy, IIRC, the main file are other scrappy scripts, which in turn will import your file containing the Pipeline: at that point, the variable __name__ won't contain __nain__ anymore - rather, it will be equal to the filename, sans .py. The outter __name__ variable will simply propagate to the exec body, if you don't provide a custom globals dir as the second parameter - so, just by looking at your code, what can be said is that the bs4_scrapper function will never be called.

    The fact you truncated your files, throwing away the import statements make it HARD to give you a definite answer - I suppose in the pipeline file (or in the script) you have something like from asyncio import run. Please - these are not optional stuff - they are necessary stuff for one reviewing your code to know what is going on.

    Either way, you have such an import or the code would not work in certain circunstances as you put - so, if the problem is what I had to guess here, you could fix it by setting the __name__ variable to __main__ inside the exec statement - but then we go to the other side: WHY this exec approach at all? You are running a Python program, reading a Python file, and issuing an statement to compile it from text so that the code could be run - when you could just import the file and call a function.

    So, you can fix your code just by making it behave like a program, not forcing one file to be read as "text" and exec-ed:

    import sys
    from pathlib import Path
    import asyncio
    
    class ...
        def close_spyder(...):
            ...
            if non_inserted_ids:
                self.insertion_errors_file(non_inserted_ids)
                bs4_dir = Path("bs4").absolute()
                if bs4_dir not in sys.path:
                     sys.path.insert(0, str(bs4_dir))
                import bs4_scraper 
                result = asyncio.run(bs4_scraper.bs4_scraper())