Search code examples
jsonasynchronousconcurrencyio

Efficient JSON (de)serialization from/to millions of small files


I have a list containing millions of small records as dicts. Instead of serialising the entire thing to a single file as JSON, I would like to write each record to a separate file. Later I need to reconstitute the list from JSON deserialised from the files.

My goal isn't really minimising I/O so much as a general strategy for serialising individual collection elements to separate files concurrently or asynchronously. What's the most efficient way to accomplish this in either Python 3.x or similar high-level language?


Solution

  • For those looking for a modern Python-based solution supporting async/await, I found this neat package which does exactly what I'm looking for: https://pypi.org/project/aiofiles/. Specifically, I can do

    import aiofiles, json
    """" A generator that reads and parses JSON from a list of files asynchronously."""
    async json_reader(files: Iterable):
        async for file in files:
            async with aiofiles.open(file) as f:
                data = await f.readlines()
                yield json.loads(data)