python multithreading python-multithreading concurrent.futures

Python multithreading for file reading results in slower performance: How to optimize?

I am learning concurrency in Python and I have noticed that the threading module even lowers the speed of my code. My code is a simple parser where I read HTMLs from my local directory and output parsed a few fields as JSON files to another directory.

I was expecting a speed improvement but the speed becomes lower, tested with small numbers of HTMLs at a time, 50, 200, 1000, and large numbers of HTMLs like 30k. In all situations, the speed lowers. For example, with 1000 HTMLs without threading speed is ~2.9 seconds, with threading speed is ~4 seconds.

Also tried the concurrent.futures ThreadPoolExecutor but it provides the same slower results.

I know about GIL, but I thought that I/O-bound tasks should be handled with multithreading.

Here is my code:

import json
import re
import time
from pathlib import Path
import threading


def get_json_data(body: str) -> re.Match[str] or None:
    return re.search(
        r'(?<=json_data">)(.*?)(?=</script>)', body
    )


def parse_html_file(file_path: Path) -> dict:
    with open(file_path, "r") as file:
        html_content = file.read()
        match = get_json_data(html_content)
        if not match:
            return {}

        next_data = match.group(1)
        json_data = json.loads(next_data)

        data1 = json_data.get("data1")
        data2 = json_data.get("data2")
        data3 = json_data.get("data3")
        data4 = json_data.get("data4")
        data5 = json_data.get("data5")

        parsed_fields = {
            "data1": data1,
            "data2": data2,
            "data3": data3,
            "data4": data4,
            "data5": data5
        }

        return parsed_fields


def save_parsed_fields(file_path: Path, parsed_fields: dict, output_dir: Path) -> None:
    output_filename = f"parsed_{file_path.stem}.json"
    output_path = output_dir / output_filename

    with open(output_path, "w") as output_file:
        json.dump(parsed_fields, output_file)

    print(f"Parsed {file_path.name} and saved the results to {output_path}")


def process_html_file(file_path: Path, parsed_dir: Path) -> None:
    parsed_fields = parse_html_file(file_path)
    save_parsed_fields(file_path, parsed_fields, parsed_dir)


def process_html_files(source_dir: Path, parsed_dir: Path) -> None:
    parsed_dir.mkdir(parents=True, exist_ok=True)

    threads = []
    for file_path in source_dir.glob("*.html"):
        thread = threading.Thread(target=process_html_file, args=(file_path, parsed_dir))
        thread.start()
        threads.append(thread)

    # Wait for all threads to finish
    for thread in threads:
        thread.join()


def main():
    base_path = "/home/my_pc/data"
    source_dir = Path(f"{base_path}/html_sample")
    parsed_dir = Path(f"{base_path}/parsed_sample")

    start_time = time.time()

    process_html_files(source_dir, parsed_dir)

    end_time = time.time()
    duration = end_time - start_time
    print(f"Application took {duration:.2f} seconds to complete.")


if __name__ == "__main__":
    main()

I know about asyncio, but I want to correctly test all of the multithreading methods to pick the best that suits me.

As mentioned tried also concurrent.futures, code is almost the same when processing html_files I have these lines:

with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Iterate over the HTML files in the source directory
        for file_path in source_dir.glob("*.html"):
            executor.submit(process_html_file, file_path, parsed_dir)

Are there any mistakes in my code? How could I optimize my code better with multithreading (aside from asyncio)?

Solution

First, you can improve parse_html_file by closing the input file as soon as you read all the data (note that the rest of the code is not in the with context manager anymore):

def parse_html_file(file_path: Path) -> dict:
    with open(file_path, "r") as file:
        html_content = file.read()
        
    match = get_json_data(html_content)
    if not match:
        return {}

    next_data = match.group(1)
    json_data = json.loads(next_data)

    data1 = json_data.get("data1")
    data2 = json_data.get("data2")
    data3 = json_data.get("data3")
    data4 = json_data.get("data4")
    data5 = json_data.get("data5")

    parsed_fields = {
        "data1": data1,
        "data2": data2,
        "data3": data3,
        "data4": data4,
        "data5": data5
    }

    return parsed_fields

Second, your solution which uses threading is not recommended as it will create as many threads as there are input file. This can be a lot and could results in slowdowns and high memory usage. The thread/process pool solution is better in this scenario. Make sure that max_workers is set to a value greater than 1 (for instance equal to the number of CPU cores of your computer).

Last, multithreading and multiprocessing are unlikely to speedup the execution of your program if the concurrently executing functions are very fast. In this case, you'll only pay the overhead of managing concurrency. You should first time finely the functions and identify the ones that are very slow (is it get_json_data or the rest of the instructions of parse_html_file, or maybe the file saving?).